-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhelai-1.5
-
False
-
-
False
-
-
-
AIPCC Accelerators 9, AIPCC Accelerators 10, AIPCC Accelerators 11
Note:
I was only able to produce this bug once out of several tests. Not sure if it's a cloud / hardware issue or an AIPCC software / driver issue.
Test System:
- Deploy RHEL AI v1.5-7 Prod AMD on IBM cloud.
- Select profile gx3d-208x1792x8mi300x (8 x AMD MI300X)
Processor: Intel vCPU: 208 RAM: 1792 GiB Speed: 200 Gbps GPU: 8 x AMD MI300X 192 GB
- Deployment date: 2025-05-16
- InstructLab profile is for 8x AMD MI300x (with no config changes made)
Steps and Results
- Ran `rhc connect`, `ilab config init, `ilab model download` and `ilab taxonomy diff`
- Ran`ilab model serve`, and I hit `ctrl +C` after it showed `INFO: Application startup complete.`
- Ran `ilab model chat and it timed out starting the vLLM server (120 attempts)
- dmesg started showing the amdgpu errors, indicating hangs.
- `amd-smi list` listed the GPUs correctly at this point
- `nvtop` hung for minutes with no output before displaying the GPUs, and then didn't refresh (or very rarely refreshed)
- Ran`ilab model chat` and `nvtop` a few more times over the next few hours, same issues.
- `dmesg` then showed different errors around the 11981 seconds timestamp
- `amd-smi list` then showed no GPUs at all.
- `nvtop` then showed no GPUs at all
- Collected dmesg output (attached)
- `reboot` command closed my SSH session but never finished stopping the VM.
- Stopped the VM using the iBM Cloud web console.
- Started the VM.
- The VM and its GPUs worked properly and ran our tests. dmesg looked normal (not attached)
Examples of errors from dmesg
- Early on during the 1st `ilab model chat`, this was repeatedly showed
[ 6450.298587] amdgpu 0000:e7:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [ 6450.303441] amdgpu 0000:e7:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
- After a few hours
[11981.086069] amdgpu 0000:dd:00.0: amdgpu: Failed to retrieve enabled ppfeatures! [11981.339523] amdgpu 0000:e7:00.0: amdgpu: qcm fence wait loop timeout expired [11981.342298] amdgpu 0000:e7:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [11981.343406] amdgpu 0000:dd:00.0: amdgpu: qcm fence wait loop timeout expired [11981.345812] amdgpu 0000:e7:00.0: amdgpu: Failed to evict process queues [11981.348269] amdgpu 0000:dd:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption [11981.355676] amdgpu 0000:dd:00.0: amdgpu: Failed to evict process queues [11981.355888] traps: ilab[12302] general protection fault ip:7f4570628898 sp:7f45635fd370 error:0 in libc.so.6[7f4570628000+175000] [11981.387794] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State [11981.429391] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State Completed [11981.827176] amdgpu 0000:e7:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_6.2.1.0 test failed (-110) [11981.831945] [drm:gfx_v9_4_3_xcc_fini [amdgpu]] *ERROR* XCD 6 KCQ disable failed [11981.861712] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State [11981.903459] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State Completed
[12008.052533] [drm:gfx_v9_4_3_xcc_fini [amdgpu]] *ERROR* XCD 7 KCQ disable failed [12008.057775] amdgpu 0000:ab:00.0: amdgpu: MODE1 reset [12008.057775] amdgpu 0000:e7:00.0: amdgpu: MODE1 reset [12008.057775] amdgpu 0000:a1:00.0: amdgpu: MODE1 reset [12008.057777] amdgpu 0000:b5:00.0: amdgpu: MODE1 reset [12008.057779] amdgpu 0000:e7:00.0: amdgpu: GPU mode1 reset [12008.057780] amdgpu 0000:a1:00.0: amdgpu: GPU mode1 reset [12008.057784] amdgpu 0000:dd:00.0: amdgpu: MODE1 reset [12008.057784] amdgpu 0000:d3:00.0: amdgpu: MODE1 reset [12008.057785] amdgpu 0000:c9:00.0: amdgpu: MODE1 reset [12008.057785] amdgpu 0000:bf:00.0: amdgpu: MODE1 reset [12008.057788] amdgpu 0000:dd:00.0: amdgpu: GPU mode1 reset [12008.057789] amdgpu 0000:d3:00.0: amdgpu: GPU mode1 reset [12008.057790] amdgpu 0000:c9:00.0: amdgpu: GPU mode1 reset [12008.057790] amdgpu 0000:bf:00.0: amdgpu: GPU mode1 reset [12008.059696] amdgpu 0000:ab:00.0: amdgpu: GPU mode1 reset [12008.061494] amdgpu 0000:b5:00.0: amdgpu: GPU mode1 reset [12008.087541] amdgpu 0000:dd:00.0: amdgpu: GPU smu mode1 reset [12008.091155] amdgpu 0000:bf:00.0: amdgpu: GPU smu mode1 reset [12008.094400] amdgpu 0000:a1:00.0: amdgpu: GPU smu mode1 reset [12008.100810] amdgpu 0000:b5:00.0: amdgpu: GPU smu mode1 reset [12008.106841] amdgpu 0000:c9:00.0: amdgpu: GPU smu mode1 reset [12008.110899] amdgpu 0000:d3:00.0: amdgpu: GPU smu mode1 reset [12008.115791] amdgpu 0000:e7:00.0: amdgpu: GPU smu mode1 reset [12008.125470] amdgpu 0000:ab:00.0: amdgpu: GPU smu mode1 reset [12012.476782] amdgpu 0000:dd:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [12012.481125] amdgpu 0000:dd:00.0: amdgpu: GPU mode1 reset failed [12012.483190] [drm] ASIC reset failed with error, -62 for drm dev, 0000:dd:00.0 [12012.485033] amdgpu 0000:c9:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [12012.491867] amdgpu 0000:c9:00.0: amdgpu: GPU mode1 reset failed [12012.494003] [drm] ASIC reset failed with error, -62 for drm dev, 0000:c9:00.0 [12012.519547] amdgpu 0000:d3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [12012.526393] amdgpu 0000:d3:00.0: amdgpu: GPU mode1 reset failed [12012.528330] amdgpu 0000:e7:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [12012.528460] [drm] ASIC reset failed with error, -62 for drm dev, 0000:d3:00.0 [12012.532330] amdgpu 0000:e7:00.0: amdgpu: GPU mode1 reset failed [12012.537299] [drm] ASIC reset failed with error, -62 for drm dev, 0000:e7:00.0 [12012.647789] amdgpu 0000:a1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [12012.649850] amdgpu 0000:bf:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000 [12012.650092] amdgpu 0000:a1:00.0: amdgpu: GPU mode1 reset failed [12012.653936] amdgpu 0000:bf:00.0: amdgpu: GPU mode1 reset failed [12012.653938] [drm] ASIC reset failed with error, -62 for drm dev, 0000:bf:00.0 [12012.657820] [drm] ASIC reset failed with error, -62 for drm dev, 0000:a1:00.0
[12033.458321] watchdog: BUG: soft lockup - CPU#130 stuck for 27s! [kworker/u418:5:14062] [12033.463320] watchdog: BUG: soft lockup - CPU#137 stuck for 27s! [kworker/u418:6:14063]
[12033.593777] Workqueue: events_unbound amdgpu_device_xgmi_reset_func [amdgpu] [12033.597348] Workqueue: events_unbound amdgpu_device_xgmi_reset_func [amdgpu] [12033.605321] RIP: 0010:amdgpu_device_rreg.part.0+0x31/0xf0 [amdgpu] [12033.605822] RIP: 0010:amdgpu_device_rreg.part.0+0x31/0xf0 [amdgpu]
- relates to
-
AIPCC-1551 1.5 AMD Azure Bootc fails to recognize MI300x GPUs after upgrading from 1.4
-
- Closed
-