Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4953

RHEL AI 1.5.4-1 doesn't detect AMD accelerators

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • Accelerators - AMD
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Approved

      image used: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.4-1760104049

      Machine: Standard_ND96is_MI300X_v5

      Not even amd-smi detects the accelerators, which are visible via lspci.

      [azureuser@fzatlouk-rhelai-1 ~]$ amd-smi 
      ERROR:root:Unable to detect any GPU devices, check amdgpu version and module status (sudo modprobe amdgpu)
      [azureuser@fzatlouk-rhelai-1 ~]$ lspci 
      0002:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0003:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0004:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0005:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0006:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0007:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0008:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      0009:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
      1876:00:02.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] (rev 80)
      2530:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      35d1:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      4fd0:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      7377:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      8395:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      86c4:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      dacc:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
      e5ae:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111

      [azureuser@fzatlouk-rhelai-1 ~]$ VLLM_LOGGING_LEVEL=DEBUG ilab model serve
      INFO 2025-10-13 14:22:19,319 instructlab.model.serve_backend:80: Setting backend_type in the serve config to vllm
      INFO 2025-10-13 14:22:19,332 instructlab.model.serve_backend:86: Using model '/var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-lab-v2.2' with -1 gpu-layers and 4096 max context size.
      INFO 2025-10-13 14:22:19,362 instructlab.model.serve_backend:133: '-gpus' flag used alongside '-tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus flag.
      INFO 2025-10-13 14:22:19,554 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:8000/v1
      DEBUG 10-13 14:22:30 [__init__.py:28] No plugins for group vllm.platform_plugins found.
      DEBUG 10-13 14:22:30 [__init__.py:34] Checking if TPU platform is available.
      DEBUG 10-13 14:22:30 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
      DEBUG 10-13 14:22:30 [__init__.py:52] Checking if CUDA platform is available.
      DEBUG 10-13 14:22:30 [__init__.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
      DEBUG 10-13 14:22:30 [__init__.py:93] CUDA platform is not available because: NVML Shared Library Not Found
      DEBUG 10-13 14:22:30 [__init__.py:100] Checking if ROCm platform is available.
      DEBUG 10-13 14:22:30 [__init__.py:109] ROCm platform is not available because no GPU is found.
      DEBUG 10-13 14:22:30 [__init__.py:122] Checking if HPU platform is available.
      DEBUG 10-13 14:22:30 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.

       

       

              rh-ee-pmattsso Percy Mattsson
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: