-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
-
False
-
-
To Reproduce Steps to reproduce the behavior:
- Install RHEL AI 1.5 Nvidia image on Baremetal
- Run single-phase training
- Run evaluation mmlu on any of the trained models
- Command - ilab model evaluate --benchmark mmlu --model /var/home/rhcert/.local/share/instructlab/checkpoints/hf_format/samples_2507
- See error
Expected behavior
- Test should pass
Logs
Device Info (please complete the following information):
- Hardware Specs: INTEL(R) XEON(R) GOLD 6548Y+ 4xL40s
- OS Version: Rhel AI 1.5 Nvidia
- InstructLab Version: 0.26.1
- Provide the output of these two commands:
- registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5 9.20250429.0
-
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 4 CUDA devices: Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.65.1.el9_4.x86_64 platform.machine: x86_64 platform.node: nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 251.39 GB memory.available: 239.33 GB memory.used: 10.26 GB InstructLab: instructlab.version: 0.26.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.8.2 instructlab-training.version: 0.10.2 Torch: torch.version: 2.6.0 torch.backends.cpu.capability: AVX512 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA L40S torch.cuda.0.free: 43.9 GB torch.cuda.0.total: 44.3 GB torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA L40S torch.cuda.1.free: 43.9 GB torch.cuda.1.total: 44.3 GB torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA L40S torch.cuda.2.free: 43.9 GB torch.cuda.2.total: 44.3 GB torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA L40S torch.cuda.3.free: 43.9 GB torch.cuda.3.total: 44.3 GB torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute) llama_cpp_python: llama_cpp_python.version: 0.3.6 llama_cpp_python.supports_gpu_offload: True
Bug impact
- Preventing the certification test from passing
Known workaround
- None
Additional context
- Error -
ERROR 2025-07-15 06:01:02,579 instructlab.cli.model.evaluate:313: An error occurred during evaluation: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/hails/mmlu_no_train/paths-info/b2e1ec9aa795adafe68e8e983248dbd4b52a1c60