Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4471

AMD GPU Hangs while Training with Instructlab

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Critical

      Observation: 

       total length: 109306 num samples 65 - rank: 3 num_loss_counted_tokens: 38502
      [21:53:27] INFO     Optimizer state saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/optimizer.bin                                                                                                                                                        fsdp_utils.py:231
                 INFO     FSDP Optimizer saved to output dir /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0                                                                                                                                                          accelerator.py:3272
                 INFO     Scheduler state saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/scheduler.bin                                                                                                                                                     checkpointing.py:122
                 INFO     Sampler state for dataloader 0 saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/sampler.bin                                                                                                                                        checkpointing.py:139
                 INFO     Random states saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/random_states_0.pkl                                                                                                                                                 checkpointing.py:170
      Saving training state: {'current_epoch': 0, 'samples_seen': 14558}
      Model state saved in: /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0
      Epoch 0: 100%|██████████| 19/19 [11:53<00:00, 37.57s/it]
       total length: 109303 num samples 16 - rank: 0 num_loss_counted_tokens: 6956
       total length: 109307 num samples 16 - rank: 0 num_loss_counted_tokens: 25582
       total length: 109204 num samples 17 - rank: 0 num_loss_counted_tokens: 33791
       total length: 109284 num samples 14 - rank: 0 num_loss_counted_tokens: 13911
       total length: 109299 num samples 15 - rank: 0 num_loss_counted_tokens: 16610
       total length: 109296 num samples 16 - rank: 0 num_loss_counted_tokens: 29114
       total length: 109245 num samples 16 - rank: 0 num_loss_counted_tokens: 29705
       total length: 109217 num samples 15 - rank: 0 num_loss_counted_tokens: 14151
       total length: 109308 num samples 19 - rank: 0 num_loss_counted_tokens: 36587
       total length: 109297 num samples 17 - rank: 0 num_loss_counted_tokens: 15037
       total length: 109245 num samples 15 - rank: 0 num_loss_counted_tokens: 8638
       total length: 109251 num samples 14 - rank: 0 num_loss_counted_tokens: 8535
       total length: 109261 num samples 13 - rank: 0 num_loss_counted_tokens: 10475
       total length: 109308 num samples 17 - rank: 0 num_loss_counted_tokens: 10561
       total length: 109305 num samples 16 - rank: 0 num_loss_counted_tokens: 9775
       total length: 109291 num samples 16 - rank: 0 num_loss_counted_tokens: 12199
       total length: 109306 num samples 15 - rank: 0 num_loss_counted_tokens: 5160
      HW Exception by GPU node-8 (Agent handle: 0x7fd142697900) reason :GPU Hang
      
      real	41m0.610s
      user	0m1.842s
      sys	0m1.511s
      [azureuser@vshaw-rhelai-
      

      Steps to reproduce the behaviour:

      1. Bring up RHELAI on top of Azure AMD MI300X *8 cores GPU with  bootc switch 1.5.2
      2. Use : registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
      3. Follow the instructions from this snippet until ilab model chat- https://gitlab.cee.redhat.com/-/snippets/9540 
      [azureuser@vshaw-rhelai-1 ~]$ sudo bootc status
      apiVersion: org.containers.bootc/v1alpha1
      kind: BootcHost
      metadata:
        name: host
      spec:
        image:
          image: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
          transport: registry
        bootOrder: default
      status:
        staged: null
        booted:
          image:
            image:
              image: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
              transport: registry
            version: 9.20250429.0
            timestamp: null
            imageDigest: sha256:1c1556421769a66cb584703febba10ebe23d6af091916eec0d5139a40082228d
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: 017bf20fd108c7136df3d2a21d4ff573dcfc4136d51b97b52bbd1ee955350702
            deploySerial: 0
        rollback:
          image:
            image:
              image: registry.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.1-1749212837
              transport: registry
            version: 9.20250429.0
            timestamp: null
            imageDigest: sha256:d4c9658b63a4fa3412fa2bbcc5af252e71d5319223f3be0b3fbf24e20c1944ce
          cachedUpdate: null
          incompatible: false
          pinned: false
          store: ostreeContainer
          ostree:
            checksum: ca88d45a43d564f644d47a03e19fbb8dc9e4ce822cf39ac9560743780c70b25d
            deploySerial: 0
        rollbackQueued: false
        type: bootcHost
      
      
      [azureuser@vshaw-rhelai-1 ~]$ ilab model list
      +------------------------------------+---------------------+---------+---------------------------------------------------------------------------+
      | Model Name                         | Last Modified       | Size    | Absolute path                                                             |
      +------------------------------------+---------------------+---------+---------------------------------------------------------------------------+
      | models/granite-3.1-8b-lab-v2.1     | 2025-06-18 15:03:44 | 31.2 GB | /var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-lab-v2.1     |
      | models/granite-3.1-8b-starter-v2.1 | 2025-06-18 15:04:20 | 31.2 GB | /var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-starter-v2.1 |
      | models/mixtral-8x7b-instruct-v0-1  | 2025-06-18 15:04:56 | 87.0 GB | /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1  |
      | models/prometheus-8x7b-v2-0        | 2025-06-18 15:05:32 | 87.0 GB | /var/home/azureuser/.cache/instructlab/models/prometheus-8x7b-v2-0        |
      +------------------------------------+---------------------+---------+---------------------------------------------------------------------------+
      
      
      
      [azureuser@vshaw-rhelai-1 ~]$ amd-smi list
      GPU: 0
          BDF: 0002:00:00.0
          UUID: 00ff74b5-0000-1000-8084-713a2fc8c56f
          KFD_ID: 65402
          NODE_ID: 2
          PARTITION_ID: 0
      
      GPU: 1
          BDF: 0003:00:00.0
          UUID: 84ff74b5-0000-1000-800c-188e6ee2f7e9
          KFD_ID: 27175
          NODE_ID: 3
          PARTITION_ID: 0
      
      GPU: 2
          BDF: 0004:00:00.0
          UUID: 73ff74b5-0000-1000-806e-7aa0bffb6f26
          KFD_ID: 16561
          NODE_ID: 4
          PARTITION_ID: 0
      
      GPU: 3
          BDF: 0005:00:00.0
          UUID: c1ff74b5-0000-1000-80cf-f2e39ad707f4
          KFD_ID: 54764
          NODE_ID: 5
          PARTITION_ID: 0
      
      GPU: 4
          BDF: 0006:00:00.0
          UUID: afff74b5-0000-1000-80a0-cceca0221f4b
          KFD_ID: 10760
          NODE_ID: 6
          PARTITION_ID: 0
      
      GPU: 5
          BDF: 0007:00:00.0
          UUID: 82ff74b5-0000-1000-8054-8e91aeb33171
          KFD_ID: 48981
          NODE_ID: 7
          PARTITION_ID: 0
      
      GPU: 6
          BDF: 0008:00:00.0
          UUID: f8ff74b5-0000-1000-809b-e74b559d0a24
          KFD_ID: 32548
          NODE_ID: 8
          PARTITION_ID: 0
      
      GPU: 7
          BDF: 0009:00:00.0
          UUID: f7ff74b5-0000-1000-805d-726356524b5a
          KFD_ID: 60025
          NODE_ID: 9
          PARTITION_ID: 0
      
      
      
      [azureuser@vshaw-rhelai-1 ~]$ ilab system info
      Platform:
        sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
        sys.platform: linux
        os.name: posix
        platform.release: 5.14.0-427.65.1.el9_4.x86_64
        platform.machine: x86_64
        platform.node: vshaw-rhelai-1.4-amd-test-westus
        platform.python_version: 3.11.7
        os-release.ID: rhel
        os-release.VERSION_ID: 9.4
        os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
        memory.total: 1820.96 GB
        memory.available: 1784.92 GB
        memory.used: 28.53 GB
      
      InstructLab:
        instructlab.version: 0.26.1
        instructlab-dolomite.version: 0.2.0
        instructlab-eval.version: 0.5.1
        instructlab-quantize.version: 0.1.0
        instructlab-schema.version: 0.4.2
        instructlab-sdg.version: 0.8.3
        instructlab-training.version: 0.10.3
      
      Torch:
        torch.version: 2.6.0
        torch.backends.cpu.capability: AVX512
        torch.version.cuda: None
        torch.version.hip: 6.3.42134-a9a80e791
        torch.cuda.available: True
        torch.backends.cuda.is_built: True
        torch.backends.mps.is_built: False
        torch.backends.mps.is_available: False
        torch.cuda.bf16: True
        torch.cuda.current.device: 0
        torch.cuda.0.name: AMD Radeon Graphics
        torch.cuda.0.free: 191.0 GB
        torch.cuda.0.total: 191.5 GB
        torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.1.name: AMD Radeon Graphics
        torch.cuda.1.free: 191.0 GB
        torch.cuda.1.total: 191.5 GB
        torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.2.name: AMD Radeon Graphics
        torch.cuda.2.free: 191.0 GB
        torch.cuda.2.total: 191.5 GB
        torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.3.name: AMD Radeon Graphics
        torch.cuda.3.free: 191.0 GB
        torch.cuda.3.total: 191.5 GB
        torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.4.name: AMD Radeon Graphics
        torch.cuda.4.free: 191.0 GB
        torch.cuda.4.total: 191.5 GB
        torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.5.name: AMD Radeon Graphics
        torch.cuda.5.free: 191.0 GB
        torch.cuda.5.total: 191.5 GB
        torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.6.name: AMD Radeon Graphics
        torch.cuda.6.free: 191.0 GB
        torch.cuda.6.total: 191.5 GB
        torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
        torch.cuda.7.name: AMD Radeon Graphics
        torch.cuda.7.free: 191.0 GB
        torch.cuda.7.total: 191.5 GB
        torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
      
      llama_cpp_python:
        llama_cpp_python.version: 0.3.6
        llama_cpp_python.supports_gpu_offload: False
      
      
      [azureuser@vshaw-rhelai-1 ~]$ ilab shell
      (app-root) /$ rocm-smi 
      
      
      ============================================ ROCm System Management Interface ============================================
      ====================================================== Concise Info ======================================================
      Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%  
                    (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)                                                  
      ==========================================================================================================================
      0       2     0x74b5,   65402  41.0°C      126.0W    NPS1, SPX, 0        161Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      1       3     0x74b5,   27175  39.0°C      123.0W    NPS1, SPX, 0        155Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      2       4     0x74b5,   16561  38.0°C      126.0W    NPS1, SPX, 0        158Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      3       5     0x74b5,   54764  40.0°C      127.0W    NPS1, SPX, 0        156Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      4       6     0x74b5,   10760  39.0°C      125.0W    NPS1, SPX, 0        159Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      5       7     0x74b5,   48981  41.0°C      125.0W    NPS1, SPX, 0        154Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      6       8     0x74b5,   32548  40.0°C      126.0W    NPS1, SPX, 0        153Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      7       9     0x74b5,   60025  39.0°C      125.0W    NPS1, SPX, 0        151Mhz  900Mhz  0%   auto  750.0W  0%     0%    
      ==========================================================================================================================
      
      

              prarit@redhat.com Prarit Bhargava
              rh-ee-vshaw Vikash Shaw
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: