Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4929

ilab model serve test failed with timeout error during RHEL AI certification

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • Accelerators - NVIDIA
    • None
    • False
    • Hide

      None

      Show
      None
    • False

      Summary: ilab_inferencing  test failed  with timeout error during RHEL AI certification

      To Reproduce Steps to reproduce the behavior:

      1. Our hardware vendor is starting to run the RHEL AI certification follow the steps mentioned in https://docs.redhat.com/en/documentation/red_hat_hardware_certification/2025/html/red_hat_hardware_certification_test_suite_user_guide/assembly_setting-up-host-systems-for-running-tests_assembly_open_new_cert_portal#for-rhel-ai-hardware-certification_hw-test-suite-setting-test-environment

      Expected behavior

      • ilab_inferencing can run successfully

      Device Info (please complete the following information):

      • Hardware Specs: [4 NVIDIA L40S GPUs, Micro-Star International Co., Ltd (MSI) MSI G4101-01]
      • OS Version: [kernel-5.14.0-427.65.1.el9_4, RHEL Ai 1.5]
      • InstructLab Version: [
        instructlab.version: 0.26.1

      instructlab-dolomite.version: 0.2.0

      instructlab-eval.version: 0.5.1

      instructlab-quantize.version: 0.1.0

      instructlab-schema.version: 0.4.2

      instructlab-sdg.version: 0.8.2

      instructlab-training.version: 0.10.2
      ]

      • Provide the output of these two commands:
        • bootc image: [registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5 (index: 0)]
        • ilab system info [^ilab system.txt]

      Bug impact

      • this test always failed, blocks the RHEL AI certification

      Known workaround

      • none

      Additional context

      • "nvidia-smi topo -m"
      • [4mGPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID[0m
        GPU0 X SYS SYS SYS 0-319 0 N/A
        GPU1 SYS X SYS SYS 0-319 0 N/A
        GPU2 SYS SYS X SYS 0-319 0 N/A
        GPU3 SYS SYS SYS X 0-319 0 N/A
      • We had asked partner to add below parameters, it still failed
        NCCL_P2P_DISABLE=1

        1. aws_g6e.x12.tar
          110 kB
          Vikash Shaw
        2. ilab_config_1gpu.txt
          7 kB
          Aman Turate
        3. ilab_config.txt
          6 kB
          Aman Turate
        4. ilab_init (1)-1.log
          1 kB
          Jinhua Li
        5. ilab_serving_1gpu.log
          35 kB
          Aman Turate
        6. ilab_serving (3).log
          26 kB
          Akshay Saswadkar
        7. ilab_serving (3)-1.log
          26 kB
          Jinhua Li
        8. ilab system-1.txt
          3 kB
          Jinhua Li
        9. rhelai_log.txt
          85 kB
          Jinhua Li

              rh-ee-vshaw Vikash Shaw
              jinhli@redhat.com Jinhua Li
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: