Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4928

[certification] HPE DL384 Gen12 2xNVIDIA GH200 144G HBM3e

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • rhelai-1.5
    • InstructLab - SDG
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      To Reproduce Steps to reproduce the behavior:

      1. Configure system as per Config
      2. do [ilab config init]
      3. start SDG process - ilab data generate --pipeline full --gpus 2
      4. See error  
        INFO 2025-08-01 15:01:07,676 instructlab.model.backends.vllm:148: Gave up waiting for vLLM server to start at http://127.0.0.1:48703/v1 after 1200 attempts
        INFO 2025-08-01 15:01:12,795 instructlab.model.backends.vllm:512: Waiting for GPU VRAM reclamation...
        failed to generate data with exception: Failed to start server: vLLM failed to start up in 3479.3 seconds 

      Expected behavior

      • It should run without error

      Device Info (please complete the following information):

      • Hardware Specs: HPE DL384 Gen12 2xNVIDIA GH200 144G HBM3e
      • OS Version: Rhel AI 1.5
      • InstructLab Version: [instructlab.version: 0.26.1]
      • Provide the output of these two commands:
        • registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5
        • ilab system info 
          ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
          ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
          ggml_cuda_init: found 2 CUDA devices:
            Device 0: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes
            Device 1: NVIDIA GH200 144G HBM3e, compute capability 9.0, VMM: yes
          Platform:
            sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux
            os.name: posix
            platform.release: 5.14.0-427.65.1.el9_4.aarch64
            platform.machine: aarch64
            platform.node: dl384rhelai
            platform.python_version: 3.11.7
            os-release.ID: rhel
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 1227.71 GB
            memory.available: 1213.35 GB
            memory.used: 8.49 GB
          
          InstructLab:
            instructlab.version: 0.26.1
            instructlab-dolomite.version: 0.2.0
            instructlab-eval.version: 0.5.1
            instructlab-quantize.version: 0.1.0
            instructlab-schema.version: 0.4.2
            instructlab-sdg.version: 0.8.2
            instructlab-training.version: 0.10.2
          
          Torch:
            torch.version: 2.6.0
            torch.backends.cpu.capability: DEFAULT
            torch.version.cuda: 12.4
            torch.version.hip: None
            torch.cuda.available: True
            torch.backends.cuda.is_built: True
            torch.backends.mps.is_built: False
            torch.backends.mps.is_available: False
            torch.cuda.bf16: True
            torch.cuda.current.device: 0
            torch.cuda.0.name: NVIDIA GH200 144G HBM3e
            torch.cuda.0.free: 142.1 GB
            torch.cuda.0.total: 142.6 GB
            torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.1.name: NVIDIA GH200 144G HBM3e
            torch.cuda.1.free: 142.1 GB
            torch.cuda.1.total: 142.6 GB
            torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
          
          llama_cpp_python:
            llama_cpp_python.version: 0.3.6
            llama_cpp_python.supports_gpu_offload: True 

      Bug impact

      • Validation efforts are blocked

      Known workaround

      • None

       

        1. ilab_config.txt
          8 kB
        2. ilab_serving (12).log
          25 kB
        3. SDG (8).log
          424 kB

              Unassigned Unassigned
              rh-ee-aturate Aman Turate
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: