-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
False
-
-
False
-
-
Summary: ilab_inferencing test failed with timeout error during RHEL AI certification
To Reproduce Steps to reproduce the behavior:
- Our hardware vendor is starting to run the RHEL AI certification follow the steps mentioned in https://docs.redhat.com/en/documentation/red_hat_hardware_certification/2025/html/red_hat_hardware_certification_test_suite_user_guide/assembly_setting-up-host-systems-for-running-tests_assembly_open_new_cert_portal#for-rhel-ai-hardware-certification_hw-test-suite-setting-test-environment
Expected behavior
- ilab_inferencing can run successfully
Device Info (please complete the following information):
- Hardware Specs: [4 NVIDIA L40S GPUs, Micro-Star International Co., Ltd (MSI) MSI G4101-01]
- OS Version: [kernel-5.14.0-427.65.1.el9_4, RHEL Ai 1.5]
- InstructLab Version: [
instructlab.version: 0.26.1
instructlab-dolomite.version: 0.2.0
instructlab-eval.version: 0.5.1
instructlab-quantize.version: 0.1.0
instructlab-schema.version: 0.4.2
instructlab-sdg.version: 0.8.2
instructlab-training.version: 0.10.2
]
- Provide the output of these two commands:
- bootc image: [registry.redhat.io/rhelai1/bootc-nvidia-rhel9:1.5 (index: 0)]
- ilab system info [^ilab system.txt]
Bug impact
- this test always failed, blocks the RHEL AI certification
Known workaround
- none
Additional context
- "nvidia-smi topo -m"
- [4mGPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU NUMA ID[0m
GPU0 X SYS SYS SYS 0-319 0 N/A
GPU1 SYS X SYS SYS 0-319 0 N/A
GPU2 SYS SYS X SYS 0-319 0 N/A
GPU3 SYS SYS SYS X 0-319 0 N/A
- We had asked partner to add below parameters, it still failed
NCCL_P2P_DISABLE=1