Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.5
Component/s: InstructLab - SDG
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Configure system as per Config
do [ilab config init]
start SDG process - ilab data generate --pipeline full --gpus 2

See error

INFO 2025-08-01 15:01:07,676 instructlab.model.backends.vllm:148: Gave up waiting for vLLM server to start at http://127.0.0.1:48703/v1 after 1200 attempts
INFO 2025-08-01 15:01:12,795 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
[31mfailed to generate data with exception: Failed to start server: vLLM failed to start up in 3479.3 seconds[0m

Expected behavior

It should run without error

Device Info (please complete the following information):

Hardware Specs: HPE DL384 Gen12 2xNVIDIA GH200 144G HBM3e
OS Version: Rhel AI 1.5
InstructLab Version: [instructlab.version: 0.26.1]

Provide the output of these two commands:

registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5

ilab system info

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes
  Device 1: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.65.1.el9_4.aarch64
  platform.machine: aarch64
  platform.node: dl384rhelai
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1227.71 GB
  memory.available: 1213.35 GB
  memory.used: 8.49 GB

InstructLab:
  instructlab.version: 0.26.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.2
  instructlab-training.version: 0.10.2

Torch:
  torch.version: 2.6.0
  torch.backends.cpu.capability: DEFAULT
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA GH200 144G HBM3e
  torch.cuda.0.free: 142.1 GB
  torch.cuda.0.total: 142.6 GB
  torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA GH200 144G HBM3e
  torch.cuda.1.free: 142.1 GB
  torch.cuda.1.total: 142.6 GB
  torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: True

Bug impact

Validation efforts are blocked

Known workaround

None

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

ilab_config.txt
8 kB
2025/09/24 11:11 AM
ilab_serving (12).log
25 kB
2025/09/24 3:03 PM
SDG (8).log
424 kB
2025/09/24 11:12 AM

Assignee:: Unassigned

Reporter:: Aman Turate

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/09/24 11:10 AM

Updated:: 2025/09/24 3:19 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates