Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4514

[Certification] Evaluation using Mt Bench failed 8xMI300

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Sprint 1, Sprint 2

      To Reproduce Steps to reproduce the behavior:

      1. After completion of multi phase training, run evaluation using MT Bench
      2. Error - An error occurred during evaluation: Failed to start server: vLLM failed to start up in 397.2 seconds
      3. Command - ilab model evaluate --benchmark mt_bench --judge-model /var/home/rhcert/.cache/instructlab/models/prometheus-8x7b-v2-0 --model /var/home/rhcert/.local/share/instructlab/checkpoints/hf_format/samples_2397 --gpus 8 --enable-serving-output

      Expected behavior

      • Tets should pass

      Screenshots

      • Attached Logs

      Device Info (please complete the following information):

      • Hardware Specs: Dell PowerEdge XE9680 
      • OS Version: RHEL AI 1.5 AMD
      • InstructLab Version: 0.26.1
      • Provide the output of these two commands:
        • sudo bootc status --format json | jq .status.booted.image.image.image to print the name and tag
          • registry.redhat.io/rhelai1/bootc-amd-rhel9:1.5
          • Version: 9.20250429.0 (2025-05-15T19:44:17Z) 
        • ilab system info to print detailed information about InstructLab version, OS, and hardware – including GPU / AI accelerator hardware
          • Platform:
            sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux
            os.name: posix
            platform.release: 5.14.0-427.65.1.el9_4.x86_64
            platform.machine: x86_64
            platform.node: j42-h01-000-xe9680.rdu3.labs.perfscale.redhat.com
            platform.python_version: 3.11.7
            os-release.ID: rhel
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 2015.38 GB
            memory.available: 1978.21 GB
            memory.used: 28.74 GB

      InstructLab:
      instructlab.version: 0.26.1
      instructlab-dolomite.version: 0.2.0
      instructlab-eval.version: 0.5.1
      instructlab-quantize.version: 0.1.0
      instructlab-schema.version: 0.4.2
      instructlab-sdg.version: 0.8.2
      instructlab-training.version: 0.10.2

      Torch:
      torch.version: 2.6.0
      torch.backends.cpu.capability: AVX512
      torch.version.cuda: None
      torch.version.hip: 6.3.42134-a9a80e791
      torch.cuda.available: True
      torch.backends.cuda.is_built: True
      torch.backends.mps.is_built: False
      torch.backends.mps.is_available: False
      torch.cuda.bf16: True
      torch.cuda.current.device: 0
      torch.cuda.0.name: AMD Radeon Graphics
      torch.cuda.0.free: 191.5 GB
      torch.cuda.0.total: 192.0 GB
      torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.1.name: AMD Radeon Graphics
      torch.cuda.1.free: 191.5 GB
      torch.cuda.1.total: 192.0 GB
      torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.2.name: AMD Radeon Graphics
      torch.cuda.2.free: 191.5 GB
      torch.cuda.2.total: 192.0 GB
      torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.3.name: AMD Radeon Graphics
      torch.cuda.3.free: 191.5 GB
      torch.cuda.3.total: 192.0 GB
      torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.4.name: AMD Radeon Graphics
      torch.cuda.4.free: 191.5 GB
      torch.cuda.4.total: 192.0 GB
      torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.5.name: AMD Radeon Graphics
      torch.cuda.5.free: 191.5 GB
      torch.cuda.5.total: 192.0 GB
      torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.6.name: AMD Radeon Graphics
      torch.cuda.6.free: 191.5 GB
      torch.cuda.6.total: 192.0 GB
      torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      torch.cuda.7.name: AMD Radeon Graphics
      torch.cuda.7.free: 191.5 GB
      torch.cuda.7.total: 192.0 GB
      torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)

      llama_cpp_python:
      llama_cpp_python.version: 0.3.6
      llama_cpp_python.supports_gpu_offload: False

      Bug impact

      • Certification of the system pending

      Known workaround

      • NA

      Additional context

      •  

              Unassigned Unassigned
              rh-ee-aturate Aman Turate
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: