-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
None
-
None
-
False
-
-
False
To Reproduce Steps to reproduce the behavior:
- On a trained model with the `granite-3-1-8b-starter` as the training base, run the evaluation command:
ilab --config /var/mnt/inststg1/instructlab/config.yaml model evaluate --benchmark mmlu --model /var/mnt/inststg1/instructlab/job/job_out/artifacts/phase2/checkpoints/hf_format/samples_372430 --gpus 8 --batch-size 64 --few-shots 2 --enable-serving-output
In case it helps the default few shots value of `5` always caused mmlu to hard error for us
We run few shots 2 after working with Oleg on some of those things, so it fails intermittently now.
Expected behavior
- Expect mmlu evaluation to work as it did in previous versions (<1.4)
Screenshots
Error from request:
WARNING 2025-03-02 10:04:38,337 lm-eval:374: API request failed with error message: Internal Server Error. Retrying... Requesting API: 5% 2560/56168 [04:23<1:11:16, 12.53it/s]WARNING 2025-03-02 10:04:43,661 lm-eval:374: API request failed with error message: Internal Server Error. Retrying... INFO 2025-03-02 10:04:53,247 instructlab.model.backends.vllm:494: Waiting for GPU VRAM reclamation... ERROR 2025-03-02 10:05:01,545 instructlab.cli.model.evaluate:272: An error occurred during evaluation: 500 Server Error: Internal Server Error for url: http://127.0.0.1:59873/v1/completions Requesting API: 5% 2560/56168 [04:45<1:39:38, 8.97it/s]
Device Info (please complete the following information):
- Hardware Specs:
GPU 8 x NVIDIA H100 80 GB
Size 160 vCPU | 1792 GiB | 200 Gbps - OS Version: RHEL AI 1.4.1
- InstructLab Version: ilab, version 0.23.1
- Provide the output of these two commands:
- sudo bootc status --format json | jq
registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4 - ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware
- sudo bootc status --format json | jq
bash-5.1# ilab system info ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.50.1.el9_4.x86_64 platform.machine: x86_64 platform.node: dev-rhel-ai-training-client-h100-2 platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1763.83 GB memory.available: 1746.22 GB memory.used: 8.10 GB InstructLab: instructlab.version: 0.23.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.7.0 instructlab-training.version: 0.7.0 Torch: torch.version: 2.5.1 torch.backends.cpu.capability: AVX512 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA H100 80GB HBM3 torch.cuda.0.free: 78.6 GB torch.cuda.0.total: 79.1 GB torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA H100 80GB HBM3 torch.cuda.1.free: 78.6 GB torch.cuda.1.total: 79.1 GB torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA H100 80GB HBM3 torch.cuda.2.free: 78.6 GB torch.cuda.2.total: 79.1 GB torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA H100 80GB HBM3 torch.cuda.3.free: 78.6 GB torch.cuda.3.total: 79.1 GB torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: NVIDIA H100 80GB HBM3 torch.cuda.4.free: 78.6 GB torch.cuda.4.total: 79.1 GB torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: NVIDIA H100 80GB HBM3 torch.cuda.5.free: 78.6 GB torch.cuda.5.total: 79.1 GB torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: NVIDIA H100 80GB HBM3 torch.cuda.6.free: 78.6 GB torch.cuda.6.total: 79.1 GB torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: NVIDIA H100 80GB HBM3 torch.cuda.7.free: 78.6 GB torch.cuda.7.total: 79.1 GB torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) llama_cpp_python: llama_cpp_python.version: 0.3.2 llama_cpp_python.supports_gpu_offload: True
Bug impact
- Please provide information on the impact of this bug to the end user.
MMLU model evaluation will not be given for trained models.
Known workaround
- Please add any known workarounds.
In case it helps the default few shots value of `5` always caused mmlu to hard error for us
We run few shots 2 after working with Oleg on some of those things, so it fails intermittently now.