Loading...

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: Model Production
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

To Reproduce Steps to reproduce the behavior:

On a trained model with the `granite-3-1-8b-starter` as the training base, run the evaluation command:

ilab --config /var/mnt/inststg1/instructlab/config.yaml model evaluate --benchmark mmlu --model /var/mnt/inststg1/instructlab/job/job_out/artifacts/phase2/checkpoints/hf_format/samples_372430 --gpus 8 --batch-size 64 --few-shots 2 --enable-serving-output

In case it helps the default few shots value of `5` always caused mmlu to hard error for us

We run few shots 2 after working with Oleg on some of those things, so it fails intermittently now.

Expected behavior

Expect mmlu evaluation to work as it did in previous versions (<1.4)

Screenshots

Error from request:

WARNING 2025-03-02 10:04:38,337 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...

Requesting API:   5% 2560/56168 [04:23<1:11:16, 12.53it/s]WARNING 2025-03-02 10:04:43,661 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...
INFO 2025-03-02 10:04:53,247 instructlab.model.backends.vllm:494: Waiting for GPU VRAM reclamation...
ERROR 2025-03-02 10:05:01,545 instructlab.cli.model.evaluate:272: An error occurred during evaluation: 500 Server Error: Internal Server Error for url: http://127.0.0.1:59873/v1/completions

Requesting API:   5% 2560/56168 [04:45<1:39:38,  8.97it/s]

Device Info (please complete the following information):

Hardware Specs:
GPU 8 x NVIDIA H100 80 GB
Size 160 vCPU | 1792 GiB | 200 Gbps
OS Version: RHEL AI 1.4.1
InstructLab Version: ilab, version 0.23.1
Provide the output of these two commands:
- sudo bootc status --format json | jq
  registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4
- ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

bash-5.1# ilab system info 
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.50.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: dev-rhel-ai-training-client-h100-2
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1763.83 GB
  memory.available: 1746.22 GB
  memory.used: 8.10 GB
 
InstructLab:
  instructlab.version: 0.23.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.7.0
  instructlab-training.version: 0.7.0
 
Torch:
  torch.version: 2.5.1
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA H100 80GB HBM3
  torch.cuda.0.free: 78.6 GB
  torch.cuda.0.total: 79.1 GB
  torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA H100 80GB HBM3
  torch.cuda.1.free: 78.6 GB
  torch.cuda.1.total: 79.1 GB
  torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA H100 80GB HBM3
  torch.cuda.2.free: 78.6 GB
  torch.cuda.2.total: 79.1 GB
  torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA H100 80GB HBM3
  torch.cuda.3.free: 78.6 GB
  torch.cuda.3.total: 79.1 GB
  torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: NVIDIA H100 80GB HBM3
  torch.cuda.4.free: 78.6 GB
  torch.cuda.4.total: 79.1 GB
  torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: NVIDIA H100 80GB HBM3
  torch.cuda.5.free: 78.6 GB
  torch.cuda.5.total: 79.1 GB
  torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: NVIDIA H100 80GB HBM3
  torch.cuda.6.free: 78.6 GB
  torch.cuda.6.total: 79.1 GB
  torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: NVIDIA H100 80GB HBM3
  torch.cuda.7.free: 78.6 GB
  torch.cuda.7.total: 79.1 GB
  torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
 
llama_cpp_python:
  llama_cpp_python.version: 0.3.2
  llama_cpp_python.supports_gpu_offload: True

Bug impact

Please provide information on the impact of this bug to the end user.

MMLU model evaluation will not be given for trained models.

Known workaround

Please add any known workarounds.

In case it helps the default few shots value of `5` always caused mmlu to hard error for us

We run few shots 2 after working with Oleg on some of those things, so it fails intermittently now.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates