Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: Accelerators - AMD
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Release Blocker:
Approved

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

image used: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.4-1760104049

Machine: Standard_ND96is_MI300X_v5

Not even amd-smi detects the accelerators, which are visible via lspci.

[azureuser@fzatlouk-rhelai-1 ~]$ amd-smi 
ERROR:root:Unable to detect any GPU devices, check amdgpu version and module status (sudo modprobe amdgpu)
[azureuser@fzatlouk-rhelai-1 ~]$ lspci 
0002:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0003:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0004:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0005:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0006:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0007:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0008:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
0009:00:00.0 Processing accelerators: Advanced Micro Devices, Inc. [AMD/ATI] Aqua Vanjaram [Instinct MI300X VF]
1876:00:02.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] (rev 80)
2530:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
35d1:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
4fd0:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
7377:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
8395:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
86c4:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
dacc:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111
e5ae:00:00.0 Non-Volatile memory controller: Microsoft Corporation Device b111

[azureuser@fzatlouk-rhelai-1 ~]$ VLLM_LOGGING_LEVEL=DEBUG ilab model serve
INFO 2025-10-13 14:22:19,319 instructlab.model.serve_backend:80: Setting backend_type in the serve config to vllm
INFO 2025-10-13 14:22:19,332 instructlab.model.serve_backend:86: Using model '/var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-lab-v2.2' with -1 gpu-layers and 4096 max context size.
INFO 2025-10-13 14:22:19,362 instructlab.model.serve_backend:133: '-~~gpus' flag used alongside '~~-tensor-parallel-size' in the vllm_args section of the config file. Using value of the --gpus flag.
INFO 2025-10-13 14:22:19,554 instructlab.model.backends.vllm:332: vLLM starting up on pid 5 at http://127.0.0.1:8000/v1
DEBUG 10-13 14:22:30 [__init__.py:28] No plugins for group vllm.platform_plugins found.
DEBUG 10-13 14:22:30 [__init__.py:34] Checking if TPU platform is available.
DEBUG 10-13 14:22:30 [__init__.py:44] TPU platform is not available because: No module named 'libtpu'
DEBUG 10-13 14:22:30 [__init__.py:52] Checking if CUDA platform is available.
DEBUG 10-13 14:22:30 [__init__.py:76] Exception happens when checking CUDA platform: NVML Shared Library Not Found
DEBUG 10-13 14:22:30 [__init__.py:93] CUDA platform is not available because: NVML Shared Library Not Found
DEBUG 10-13 14:22:30 [__init__.py:100] Checking if ROCm platform is available.
DEBUG 10-13 14:22:30 [__init__.py:109] ROCm platform is not available because no GPU is found.
DEBUG 10-13 14:22:30 [__init__.py:122] Checking if HPU platform is available.
DEBUG 10-13 14:22:30 [__init__.py:129] HPU platform is not available because habana_frameworks is not found.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

dmsg.log
1.02 MB
2025/10/14 8:12 AM

mentioned in: Page Loading...

Assignee:: Percy Mattsson

Reporter:: František Zatloukal

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/10/13 2:37 PM

Updated:: 2025/10/22 9:15 AM

Resolved:: 2025/10/22 9:15 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates