Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-1524

AMD GPUs stopped working entirely on IBM Cloud

    • False
    • Hide

      None

      Show
      None
    • False
    • AIPCC Accelerators 9, AIPCC Accelerators 10, AIPCC Accelerators 11

      Note:

      I was only able to produce this bug once out of several tests. Not sure if it's a cloud / hardware issue or an AIPCC software / driver issue.

      Test System:

      • Deploy RHEL AI v1.5-7 Prod AMD on IBM cloud.
      • Select profile gx3d-208x1792x8mi300x (8 x AMD MI300X)
      Processor: Intel
      vCPU: 208
      RAM: 1792 GiB
      Speed: 200 Gbps
      GPU: 8 x AMD MI300X 192 GB 
      • Deployment date: 2025-05-16
      • InstructLab profile is for 8x AMD MI300x (with no config changes made)

      Steps and Results

      1. Ran `rhc connect`, `ilab config init, `ilab model download` and `ilab taxonomy diff`
      2. Ran`ilab model serve`, and I hit `ctrl +C` after it showed `INFO: Application startup complete.` 
      3. Ran `ilab model chat and it timed out starting the vLLM server (120 attempts)
      4. dmesg started showing the amdgpu errors, indicating hangs.
      5. `amd-smi list` listed the GPUs correctly at this point
      6. `nvtop` hung for minutes with no output before displaying the GPUs, and then didn't refresh (or very rarely refreshed)
      7. Ran`ilab model chat` and `nvtop` a few more times over the next few hours, same issues.
      8. `dmesg` then showed different errors around the 11981 seconds timestamp
      9. `amd-smi list` then showed no GPUs at all.
      10. `nvtop` then showed no GPUs at all
      11. Collected dmesg output (attached)
      12. `reboot` command closed my SSH session but never finished stopping the VM.
      13. Stopped the VM using the iBM Cloud web console.
      14. Started the VM.
      15. The VM and its GPUs worked properly and ran our tests. dmesg looked normal (not attached)

       

      Examples of errors from dmesg

      • Early on during the 1st `ilab model chat`, this was repeatedly showed

       

      [ 6450.298587] amdgpu 0000:e7:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [ 6450.303441] amdgpu 0000:e7:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
       
      • After a few hours
      [11981.086069] amdgpu 0000:dd:00.0: amdgpu: Failed to retrieve enabled ppfeatures!
      [11981.339523] amdgpu 0000:e7:00.0: amdgpu: qcm fence wait loop timeout expired
      [11981.342298] amdgpu 0000:e7:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
      [11981.343406] amdgpu 0000:dd:00.0: amdgpu: qcm fence wait loop timeout expired
      [11981.345812] amdgpu 0000:e7:00.0: amdgpu: Failed to evict process queues
      [11981.348269] amdgpu 0000:dd:00.0: amdgpu: The cp might be in an unrecoverable state due to an unsuccessful queues preemption
      [11981.355676] amdgpu 0000:dd:00.0: amdgpu: Failed to evict process queues
      [11981.355888] traps: ilab[12302] general protection fault ip:7f4570628898 sp:7f45635fd370 error:0 in libc.so.6[7f4570628000+175000]
      [11981.387794] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State
      [11981.429391] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State Completed
      [11981.827176] amdgpu 0000:e7:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_6.2.1.0 test failed (-110)
      [11981.831945] [drm:gfx_v9_4_3_xcc_fini [amdgpu]] *ERROR* XCD 6 KCQ disable failed
      [11981.861712] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State
      [11981.903459] amdgpu 0000:e7:00.0: amdgpu: Dumping IP State Completed
       
      [12008.052533] [drm:gfx_v9_4_3_xcc_fini [amdgpu]] *ERROR* XCD 7 KCQ disable failed
      [12008.057775] amdgpu 0000:ab:00.0: amdgpu: MODE1 reset
      [12008.057775] amdgpu 0000:e7:00.0: amdgpu: MODE1 reset
      [12008.057775] amdgpu 0000:a1:00.0: amdgpu: MODE1 reset
      [12008.057777] amdgpu 0000:b5:00.0: amdgpu: MODE1 reset
      [12008.057779] amdgpu 0000:e7:00.0: amdgpu: GPU mode1 reset
      [12008.057780] amdgpu 0000:a1:00.0: amdgpu: GPU mode1 reset
      [12008.057784] amdgpu 0000:dd:00.0: amdgpu: MODE1 reset
      [12008.057784] amdgpu 0000:d3:00.0: amdgpu: MODE1 reset
      [12008.057785] amdgpu 0000:c9:00.0: amdgpu: MODE1 reset
      [12008.057785] amdgpu 0000:bf:00.0: amdgpu: MODE1 reset
      [12008.057788] amdgpu 0000:dd:00.0: amdgpu: GPU mode1 reset
      [12008.057789] amdgpu 0000:d3:00.0: amdgpu: GPU mode1 reset
      [12008.057790] amdgpu 0000:c9:00.0: amdgpu: GPU mode1 reset
      [12008.057790] amdgpu 0000:bf:00.0: amdgpu: GPU mode1 reset
      [12008.059696] amdgpu 0000:ab:00.0: amdgpu: GPU mode1 reset
      [12008.061494] amdgpu 0000:b5:00.0: amdgpu: GPU mode1 reset
      [12008.087541] amdgpu 0000:dd:00.0: amdgpu: GPU smu mode1 reset
      [12008.091155] amdgpu 0000:bf:00.0: amdgpu: GPU smu mode1 reset
      [12008.094400] amdgpu 0000:a1:00.0: amdgpu: GPU smu mode1 reset
      [12008.100810] amdgpu 0000:b5:00.0: amdgpu: GPU smu mode1 reset
      [12008.106841] amdgpu 0000:c9:00.0: amdgpu: GPU smu mode1 reset
      [12008.110899] amdgpu 0000:d3:00.0: amdgpu: GPU smu mode1 reset
      [12008.115791] amdgpu 0000:e7:00.0: amdgpu: GPU smu mode1 reset
      [12008.125470] amdgpu 0000:ab:00.0: amdgpu: GPU smu mode1 reset
      [12012.476782] amdgpu 0000:dd:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [12012.481125] amdgpu 0000:dd:00.0: amdgpu: GPU mode1 reset failed
      [12012.483190] [drm] ASIC reset failed with error, -62 for drm dev, 0000:dd:00.0
      [12012.485033] amdgpu 0000:c9:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [12012.491867] amdgpu 0000:c9:00.0: amdgpu: GPU mode1 reset failed 
      [12012.494003] [drm] ASIC reset failed with error, -62 for drm dev, 0000:c9:00.0
      [12012.519547] amdgpu 0000:d3:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [12012.526393] amdgpu 0000:d3:00.0: amdgpu: GPU mode1 reset failed
      [12012.528330] amdgpu 0000:e7:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [12012.528460] [drm] ASIC reset failed with error, -62 for drm dev, 0000:d3:00.0
      [12012.532330] amdgpu 0000:e7:00.0: amdgpu: GPU mode1 reset failed
      [12012.537299] [drm] ASIC reset failed with error, -62 for drm dev, 0000:e7:00.0
      [12012.647789] amdgpu 0000:a1:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [12012.649850] amdgpu 0000:bf:00.0: amdgpu: SMU: I'm not done with your previous command: SMN_C2PMSG_66:0x0000000C SMN_C2PMSG_82:0x00000000
      [12012.650092] amdgpu 0000:a1:00.0: amdgpu: GPU mode1 reset failed 
      [12012.653936] amdgpu 0000:bf:00.0: amdgpu: GPU mode1 reset failed
      [12012.653938] [drm] ASIC reset failed with error, -62 for drm dev, 0000:bf:00.0
      [12012.657820] [drm] ASIC reset failed with error, -62 for drm dev, 0000:a1:00.0
      
      [12033.458321] watchdog: BUG: soft lockup - CPU#130 stuck for 27s! [kworker/u418:5:14062]
      [12033.463320] watchdog: BUG: soft lockup - CPU#137 stuck for 27s! [kworker/u418:6:14063]
       
      [12033.593777] Workqueue: events_unbound amdgpu_device_xgmi_reset_func [amdgpu]
      [12033.597348] Workqueue: events_unbound amdgpu_device_xgmi_reset_func [amdgpu]
      
      
      [12033.605321] RIP: 0010:amdgpu_device_rreg.part.0+0x31/0xf0 [amdgpu]
      [12033.605822] RIP: 0010:amdgpu_device_rreg.part.0+0x31/0xf0 [amdgpu]
      
      

              mdepaulo@redhat.com Mike DePaulo
              mdepaulo@redhat.com Mike DePaulo
              Frank's Team
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: