-
Bug
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
False
-
-
False
-
-
-
Approved
image used: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.4-1760104049
Machine: Standard_ND96is_MI300X_v5
Not even amd-smi detects the accelerators, which are visible via lspci.
[azureuser@fzatlouk-rhelai-1 ~]$ amd-smi ERROR:root:Unable to detect any GPU devices, check amdgpu version and module status (sudo modprobe amdgpu) [azureuser@fzatlouk-rhelai-1 ~]$ lspci 0002:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0003:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0004:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0005:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0006:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0007:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0008:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 0009:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF] 1876:00:02.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] (rev 80) 2530:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 35d1:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 4fd0:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 7377:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 8395:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 86c4:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 dacc:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111 e5ae:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
[azureuser@fzatlouk-rhelai-1 ~]$ VLLM_LOGGING_LEVEL=DEBUG ilab model serve
INFO 2025-10-13 14:22:19,319 instructlab.model.serve_backend:80: Setting backend_type in the serve config to vllm
INFO 2025-10-13 14:22:19,332 instructlab.model.serve_backend:86: Using model '/var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-lab-v2.2' with -1 gpu-layers and 4096 max context size.
INFO 2025-10-13 14:22:19,362 instructlab.model.serve_backend:133: '-gpus' flag used alongside '-tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus flag.
INFO 2025-10-13 14:22:19,554 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:8000/v1
DEBUG 10-13 14:22:30 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 10-13 14:22:30 [__init__.py:34] Checking if TPU platform is available.
DEBUG 10-13 14:22:30 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 10-13 14:22:30 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 10-13 14:22:30 [__init__.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 10-13 14:22:30 [__init__.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 10-13 14:22:30 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 10-13 14:22:30 [__init__.py:109] ROCm platform is not available because no GPU is found.
DEBUG 10-13 14:22:30 [__init__.py:122] Checking if HPU platform is available.
DEBUG 10-13 14:22:30 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.
- mentioned in
-
Page Loading...