Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - Evaluation
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

Install RHEL AI 1.5 Nvidia image on Baremetal
Run single-phase training
Run evaluation mmlu on any of the trained models
1. Command - ilab model evaluate --benchmark mmlu --model /var/home/rhcert/.local/share/instructlab/checkpoints/hf_format/samples_2507
See error

Expected behavior

Test should pass

Logs

https://docs.google.com/document/d/1eNEU-6VqSU-TUyyVNvad9AiIfy5qTOaeC5xE37t0was/edit?usp=sharing

Device Info (please complete the following information):

Hardware Specs: INTEL(R) XEON(R) GOLD 6548Y+ 4xL40s
OS Version: Rhel AI 1.5 Nvidia
InstructLab Version: 0.26.1

Provide the output of these two commands:

registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5 9.20250429.0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 4 CUDA devices:
  Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes
  Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.65.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 251.39 GB
  memory.available: 239.33 GB
  memory.used: 10.26 GB
InstructLab:
  instructlab.version: 0.26.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.2
  instructlab-training.version: 0.10.2
Torch:
  torch.version: 2.6.0
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA L40S
  torch.cuda.0.free: 43.9 GB
  torch.cuda.0.total: 44.3 GB
  torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA L40S
  torch.cuda.1.free: 43.9 GB
  torch.cuda.1.total: 44.3 GB
  torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA L40S
  torch.cuda.2.free: 43.9 GB
  torch.cuda.2.total: 44.3 GB
  torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA L40S
  torch.cuda.3.free: 43.9 GB
  torch.cuda.3.total: 44.3 GB
  torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: True

Bug impact

Preventing the certification test from passing

Known workaround

None

Additional context

Error -

ERROR 2025-07-15 06:01:02,579 instructlab.cli.model.evaluate:313: An error occurred during evaluation: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/hails/mmlu_no_train/paths-info/b2e1ec9aa795adafe68e8e983248dbd4b52a1c60

Assignee:: Unassigned

Reporter:: Aman Turate

Contributors:: Courtney Pacheco, Oleg Silkin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/07/16 7:13 AM

Updated:: 2025/07/16 6:27 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates