Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4654

[certification] Evaluation with MMLU benchmark failure

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce Steps to reproduce the behavior:

      1. Install RHEL AI 1.5 Nvidia image on Baremetal
      2. Run single-phase training 
      3. Run evaluation mmlu on any of the trained models
        1. Command - ilab model evaluate --benchmark mmlu --model /var/home/rhcert/.local/share/instructlab/checkpoints/hf_format/samples_2507
      4. See error

      Expected behavior

      • Test should pass

      Logs

      Device Info (please complete the following information):

      • Hardware Specs: INTEL(R) XEON(R) GOLD 6548Y+ 4xL40s
      • OS Version: Rhel AI 1.5 Nvidia
      • InstructLab Version: 0.26.1
      • Provide the output of these two commands:
        • registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5 9.20250429.0
        •  
          ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
          ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
          ggml_cuda_init: found 4 CUDA devices:
            Device 0: NVIDIA L40S, compute capability 8.9, VMM: yes
            Device 1: NVIDIA L40S, compute capability 8.9, VMM: yes
            Device 2: NVIDIA L40S, compute capability 8.9, VMM: yes
            Device 3: NVIDIA L40S, compute capability 8.9, VMM: yes
          Platform:
            sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux
            os.name: posix
            platform.release: 5.14.0-427.65.1.el9_4.x86_64
            platform.machine: x86_64
            platform.node: nvd-srv-28.nvidia.eng.rdu2.dc.redhat.com
            platform.python_version: 3.11.7
            os-release.ID: rhel
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 251.39 GB
            memory.available: 239.33 GB
            memory.used: 10.26 GB
          InstructLab:
            instructlab.version: 0.26.1
            instructlab-dolomite.version: 0.2.0
            instructlab-eval.version: 0.5.1
            instructlab-quantize.version: 0.1.0
            instructlab-schema.version: 0.4.2
            instructlab-sdg.version: 0.8.2
            instructlab-training.version: 0.10.2
          Torch:
            torch.version: 2.6.0
            torch.backends.cpu.capability: AVX512
            torch.version.cuda: 12.4
            torch.version.hip: None
            torch.cuda.available: True
            torch.backends.cuda.is_built: True
            torch.backends.mps.is_built: False
            torch.backends.mps.is_available: False
            torch.cuda.bf16: True
            torch.cuda.current.device: 0
            torch.cuda.0.name: NVIDIA L40S
            torch.cuda.0.free: 43.9 GB
            torch.cuda.0.total: 44.3 GB
            torch.cuda.0.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.1.name: NVIDIA L40S
            torch.cuda.1.free: 43.9 GB
            torch.cuda.1.total: 44.3 GB
            torch.cuda.1.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.2.name: NVIDIA L40S
            torch.cuda.2.free: 43.9 GB
            torch.cuda.2.total: 44.3 GB
            torch.cuda.2.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.3.name: NVIDIA L40S
            torch.cuda.3.free: 43.9 GB
            torch.cuda.3.total: 44.3 GB
            torch.cuda.3.capability: 8.9 (see https://developer.nvidia.com/cuda-gpus#compute)
          llama_cpp_python:
            llama_cpp_python.version: 0.3.6
            llama_cpp_python.supports_gpu_offload: True
          

           

      Bug impact

      • Preventing the certification test from passing

      Known workaround

      • None

      Additional context

      • Error -
        ERROR 2025-07-15 06:01:02,579 instructlab.cli.model.evaluate:313: An error occurred during evaluation: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/hails/mmlu_no_train/paths-info/b2e1ec9aa795adafe68e8e983248dbd4b52a1c60

              Unassigned Unassigned
              rh-ee-aturate Aman Turate
              Courtney Pacheco, Oleg Silkin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: