Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.5
Component/s: InstructLab - Core, InstructLab - Evaluation, InstructLab - Training, RHELAI - Azure
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
Sprint 1, Sprint 2

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

After completion of multi phase training, run evaluation using MT Bench
Error - An error occurred during evaluation: Failed to start server: vLLM failed to start up in 397.2 seconds
Command -ilab model evaluate --benchmark mt_bench --judge-model /var/home/rhcert/.cache/instructlab/models/prometheus-8x7b-v2-0 --model /var/home/rhcert/.local/share/instructlab/checkpoints/hf_format/samples_2394 --gpus 8

Expected behavior

Test should pass

Screenshots

Attached Logs

Device Info (please complete the following information):

Hardware Specs: Azure ND MI300x v5
OS Version: registry.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5
InstructLab Version: 0.26.1
Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag of the bootc image
  - registry.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5
- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware
  - Platform:
    sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
    sys.platform: linux
    os.name: posix
    platform.release: 5.14.0-427.65.1.el9_4.x86_64
    platform.machine: x86_64
    platform.node: hwcert-rhelai-amd-v2-1
    platform.python_version: 3.11.7
    os-release.ID: rhel
    os-release.VERSION_ID: 9.4
    os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
    memory.total: 1819.96 GB
    memory.available: 1785.63 GB
    memory.used: 28.70 GB

InstructLab:
instructlab.version: 0.26.1
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.8.2
instructlab-training.version: 0.10.2

Torch:
torch.version: 2.6.0
torch.backends.cpu.capability: AVX512
torch.version.cuda: None
torch.version.hip: 6.3.42134-a9a80e791
torch.cuda.available: True
torch.backends.cuda.is_built: True
torch.backends.mps.is_built: False
torch.backends.mps.is_available: False
torch.cuda.bf16: True
torch.cuda.current.device: 0
torch.cuda.0.name: AMD Radeon Graphics
torch.cuda.0.free: 191.0 GB
torch.cuda.0.total: 191.5 GB
torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.1.name: AMD Radeon Graphics
torch.cuda.1.free: 191.0 GB
torch.cuda.1.total: 191.5 GB
torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.2.name: AMD Radeon Graphics
torch.cuda.2.free: 191.0 GB
torch.cuda.2.total: 191.5 GB
torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.3.name: AMD Radeon Graphics
torch.cuda.3.free: 191.0 GB
torch.cuda.3.total: 191.5 GB
torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.4.name: AMD Radeon Graphics
torch.cuda.4.free: 191.0 GB
torch.cuda.4.total: 191.5 GB
torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.5.name: AMD Radeon Graphics
torch.cuda.5.free: 191.0 GB
torch.cuda.5.total: 191.5 GB
torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.6.name: AMD Radeon Graphics
torch.cuda.6.free: 191.0 GB
torch.cuda.6.total: 191.5 GB
torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
torch.cuda.7.name: AMD Radeon Graphics
torch.cuda.7.free: 191.0 GB
torch.cuda.7.total: 191.5 GB
torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:
llama_cpp_python.version: 0.3.6
llama_cpp_python.supports_gpu_offload: False

Bug impact

Certification of azure instance is pending

Known workaround

Additional context

<your text here>
…

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

evaluation_mt_bench (4).log
434 kB
2025/07/07 12:51 PM
ilab_init (1).log
1 kB
2025/07/07 12:51 PM
multi_phase_train (10).log
3.45 MB
2025/07/07 12:51 PM
rhelai_log.txt
80 kB
2025/07/07 12:51 PM
Untitled
461 kB
2025/07/07 2:47 PM

Assignee:: Unassigned

Reporter:: Aman Turate

Contributors:: Kamesh Akella, Ken Dreyer

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/06/27 5:28 AM

Updated:: 2025/08/13 1:08 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates