Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3604

mmlu using granite-3-1-8b-starter training base model on RHEL-AI 1.4.1+ has intermittent failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • None
    • Model Production
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce Steps to reproduce the behavior:

      1. On a trained model with the `granite-3-1-8b-starter` as the training base, run the evaluation command:
      ilab --config /var/mnt/inststg1/instructlab/config.yaml model evaluate --benchmark mmlu --model /var/mnt/inststg1/instructlab/job/job_out/artifacts/phase2/checkpoints/hf_format/samples_372430 --gpus 8 --batch-size 64 --few-shots 2 --enable-serving-output 

      In case it helps the default few shots value of `5` always caused mmlu to hard error for us

      We run few shots 2 after working with Oleg on some of those things, so it fails intermittently now. 

      Expected behavior

      • Expect mmlu evaluation to work as it did in previous versions (<1.4)

      Screenshots

       

      Error from request:

       

      WARNING 2025-03-02 10:04:38,337 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...
      
      Requesting API:   5% 2560/56168 [04:23<1:11:16, 12.53it/s]WARNING 2025-03-02 10:04:43,661 lm-eval:374: API request failed with error message: Internal Server Error. Retrying...
      INFO 2025-03-02 10:04:53,247 instructlab.model.backends.vllm:494: Waiting for GPU VRAM reclamation...
      ERROR 2025-03-02 10:05:01,545 instructlab.cli.model.evaluate:272: An error occurred during evaluation: 500 Server Error: Internal Server Error for url: http://127.0.0.1:59873/v1/completions
      
      Requesting API:   5% 2560/56168 [04:45<1:39:38,  8.97it/s] 

       

       

      Device Info (please complete the following information):

      • Hardware Specs: 
        GPU 8 x NVIDIA H100 80 GB
        Size 160 vCPU | 1792 GiB | 200 Gbps
      • OS Version: RHEL AI 1.4.1
      • InstructLab Version: ilab, version 0.23.1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq 
          registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.4
        • ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware

       

      bash-5.1# ilab system info 
      ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
      ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
      ggml_cuda_init: found 8 CUDA devices:
        Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
        Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.50.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: dev-rhel-ai-training-client-h100-2
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1763.83 GB
        memory.available: 1746.22 GB
        memory.used: 8.10 GB
       
      InstructLab:
        instructlab.version: 0.23.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.7.0
        instructlab-training.version: 0.7.0
       
      Torch:
        torch.version: 2.5.1
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: 12.4
        torch.version.hip: None
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: NVIDIA H100 80GB HBM3
        torch.cuda.0.free: 78.6 GB
        torch.cuda.0.total: 79.1 GB
        torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: NVIDIA H100 80GB HBM3
        torch.cuda.1.free: 78.6 GB
        torch.cuda.1.total: 79.1 GB
        torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: NVIDIA H100 80GB HBM3
        torch.cuda.2.free: 78.6 GB
        torch.cuda.2.total: 79.1 GB
        torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: NVIDIA H100 80GB HBM3
        torch.cuda.3.free: 78.6 GB
        torch.cuda.3.total: 79.1 GB
        torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: NVIDIA H100 80GB HBM3
        torch.cuda.4.free: 78.6 GB
        torch.cuda.4.total: 79.1 GB
        torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: NVIDIA H100 80GB HBM3
        torch.cuda.5.free: 78.6 GB
        torch.cuda.5.total: 79.1 GB
        torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: NVIDIA H100 80GB HBM3
        torch.cuda.6.free: 78.6 GB
        torch.cuda.6.total: 79.1 GB
        torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: NVIDIA H100 80GB HBM3
        torch.cuda.7.free: 78.6 GB
        torch.cuda.7.total: 79.1 GB
        torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
       
      llama_cpp_python:
        llama_cpp_python.version: 0.3.2
        llama_cpp_python.supports_gpu_offload: True
      

       

      Bug impact

      • Please provide information on the impact of this bug to the end user.

      MMLU model evaluation will not be given for trained models.

      Known workaround

      • Please add any known workarounds.

      In case it helps the default few shots value of `5` always caused mmlu to hard error for us

      We run few shots 2 after working with Oleg on some of those things, so it fails intermittently now. 

       

              Unassigned Unassigned
              kodieglosseribm Kodie Glosser
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: