Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: rhelai-1.5.2
Component/s: Accelerators - AMD, InstructLab - SDG, InstructLab - Training, RHELAI - Azure, SDG/Data
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Critical

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Observation:

 total length: 109306 num samples 65 - rank: 3 num_loss_counted_tokens: 38502
[21:53:27] INFO     Optimizer state saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/optimizer.bin                                                                                                                                                        fsdp_utils.py:231
           INFO     FSDP Optimizer saved to output dir /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0                                                                                                                                                          accelerator.py:3272
           INFO     Scheduler state saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/scheduler.bin                                                                                                                                                     checkpointing.py:122
           INFO     Sampler state for dataloader 0 saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/sampler.bin                                                                                                                                        checkpointing.py:139
           INFO     Random states saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/random_states_0.pkl                                                                                                                                                 checkpointing.py:170
Saving training state: {'current_epoch': 0, 'samples_seen': 14558}
Model state saved in: /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0
Epoch 0: 100%|██████████| 19/19 [11:53<00:00, 37.57s/it]
 total length: 109303 num samples 16 - rank: 0 num_loss_counted_tokens: 6956
 total length: 109307 num samples 16 - rank: 0 num_loss_counted_tokens: 25582
 total length: 109204 num samples 17 - rank: 0 num_loss_counted_tokens: 33791
 total length: 109284 num samples 14 - rank: 0 num_loss_counted_tokens: 13911
 total length: 109299 num samples 15 - rank: 0 num_loss_counted_tokens: 16610
 total length: 109296 num samples 16 - rank: 0 num_loss_counted_tokens: 29114
 total length: 109245 num samples 16 - rank: 0 num_loss_counted_tokens: 29705
 total length: 109217 num samples 15 - rank: 0 num_loss_counted_tokens: 14151
 total length: 109308 num samples 19 - rank: 0 num_loss_counted_tokens: 36587
 total length: 109297 num samples 17 - rank: 0 num_loss_counted_tokens: 15037
 total length: 109245 num samples 15 - rank: 0 num_loss_counted_tokens: 8638
 total length: 109251 num samples 14 - rank: 0 num_loss_counted_tokens: 8535
 total length: 109261 num samples 13 - rank: 0 num_loss_counted_tokens: 10475
 total length: 109308 num samples 17 - rank: 0 num_loss_counted_tokens: 10561
 total length: 109305 num samples 16 - rank: 0 num_loss_counted_tokens: 9775
 total length: 109291 num samples 16 - rank: 0 num_loss_counted_tokens: 12199
 total length: 109306 num samples 15 - rank: 0 num_loss_counted_tokens: 5160
HW Exception by GPU node-8 (Agent handle: 0x7fd142697900) reason :GPU Hang

real	41m0.610s
user	0m1.842s
sys	0m1.511s
[azureuser@vshaw-rhelai-

Steps to reproduce the behaviour:

Bring up RHELAI on top of Azure AMD MI300X *8 cores GPU with bootc switch 1.5.2
Use : registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
Follow the instructions from this snippet until ilab model chat- https://gitlab.cee.redhat.com/-/snippets/9540

[azureuser@vshaw-rhelai-1 ~]$ sudo bootc status
apiVersion: org.containers.bootc/v1alpha1
kind: BootcHost
metadata:
  name: host
spec:
  image:
    image: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
    transport: registry
  bootOrder: default
status:
  staged: null
  booted:
    image:
      image:
        image: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
        transport: registry
      version: 9.20250429.0
      timestamp: null
      imageDigest: sha256:1c1556421769a66cb584703febba10ebe23d6af091916eec0d5139a40082228d
    cachedUpdate: null
    incompatible: false
    pinned: false
    store: ostreeContainer
    ostree:
      checksum: 017bf20fd108c7136df3d2a21d4ff573dcfc4136d51b97b52bbd1ee955350702
      deploySerial: 0
  rollback:
    image:
      image:
        image: registry.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.1-1749212837
        transport: registry
      version: 9.20250429.0
      timestamp: null
      imageDigest: sha256:d4c9658b63a4fa3412fa2bbcc5af252e71d5319223f3be0b3fbf24e20c1944ce
    cachedUpdate: null
    incompatible: false
    pinned: false
    store: ostreeContainer
    ostree:
      checksum: ca88d45a43d564f644d47a03e19fbb8dc9e4ce822cf39ac9560743780c70b25d
      deploySerial: 0
  rollbackQueued: false
  type: bootcHost

[azureuser@vshaw-rhelai-1 ~]$ ilab model list
+------------------------------------+---------------------+---------+---------------------------------------------------------------------------+
| Model Name                         | Last Modified       | Size    | Absolute path                                                             |
+------------------------------------+---------------------+---------+---------------------------------------------------------------------------+
| models/granite-3.1-8b-lab-v2.1     | 2025-06-18 15:03:44 | 31.2 GB | /var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-lab-v2.1     |
| models/granite-3.1-8b-starter-v2.1 | 2025-06-18 15:04:20 | 31.2 GB | /var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-starter-v2.1 |
| models/mixtral-8x7b-instruct-v0-1  | 2025-06-18 15:04:56 | 87.0 GB | /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1  |
| models/prometheus-8x7b-v2-0        | 2025-06-18 15:05:32 | 87.0 GB | /var/home/azureuser/.cache/instructlab/models/prometheus-8x7b-v2-0        |
+------------------------------------+---------------------+---------+---------------------------------------------------------------------------+

[azureuser@vshaw-rhelai-1 ~]$ amd-smi list
GPU: 0
    BDF: 0002:00:00.0
    UUID: 00ff74b5-0000-1000-8084-713a2fc8c56f
    KFD_ID: 65402
    NODE_ID: 2
    PARTITION_ID: 0

GPU: 1
    BDF: 0003:00:00.0
    UUID: 84ff74b5-0000-1000-800c-188e6ee2f7e9
    KFD_ID: 27175
    NODE_ID: 3
    PARTITION_ID: 0

GPU: 2
    BDF: 0004:00:00.0
    UUID: 73ff74b5-0000-1000-806e-7aa0bffb6f26
    KFD_ID: 16561
    NODE_ID: 4
    PARTITION_ID: 0

GPU: 3
    BDF: 0005:00:00.0
    UUID: c1ff74b5-0000-1000-80cf-f2e39ad707f4
    KFD_ID: 54764
    NODE_ID: 5
    PARTITION_ID: 0

GPU: 4
    BDF: 0006:00:00.0
    UUID: afff74b5-0000-1000-80a0-cceca0221f4b
    KFD_ID: 10760
    NODE_ID: 6
    PARTITION_ID: 0

GPU: 5
    BDF: 0007:00:00.0
    UUID: 82ff74b5-0000-1000-8054-8e91aeb33171
    KFD_ID: 48981
    NODE_ID: 7
    PARTITION_ID: 0

GPU: 6
    BDF: 0008:00:00.0
    UUID: f8ff74b5-0000-1000-809b-e74b559d0a24
    KFD_ID: 32548
    NODE_ID: 8
    PARTITION_ID: 0

GPU: 7
    BDF: 0009:00:00.0
    UUID: f7ff74b5-0000-1000-805d-726356524b5a
    KFD_ID: 60025
    NODE_ID: 9
    PARTITION_ID: 0

[azureuser@vshaw-rhelai-1 ~]$ ilab system info
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.65.1.el9_4.x86_64
  platform.machine: x86_64
  platform.node: vshaw-rhelai-1.4-amd-test-westus
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1820.96 GB
  memory.available: 1784.92 GB
  memory.used: 28.53 GB

InstructLab:
  instructlab.version: 0.26.1
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.8.3
  instructlab-training.version: 0.10.3

Torch:
  torch.version: 2.6.0
  torch.backends.cpu.capability: AVX512
  torch.version.cuda: None
  torch.version.hip: 6.3.42134-a9a80e791
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: AMD Radeon Graphics
  torch.cuda.0.free: 191.0 GB
  torch.cuda.0.total: 191.5 GB
  torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: AMD Radeon Graphics
  torch.cuda.1.free: 191.0 GB
  torch.cuda.1.total: 191.5 GB
  torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: AMD Radeon Graphics
  torch.cuda.2.free: 191.0 GB
  torch.cuda.2.total: 191.5 GB
  torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: AMD Radeon Graphics
  torch.cuda.3.free: 191.0 GB
  torch.cuda.3.total: 191.5 GB
  torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: AMD Radeon Graphics
  torch.cuda.4.free: 191.0 GB
  torch.cuda.4.total: 191.5 GB
  torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: AMD Radeon Graphics
  torch.cuda.5.free: 191.0 GB
  torch.cuda.5.total: 191.5 GB
  torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: AMD Radeon Graphics
  torch.cuda.6.free: 191.0 GB
  torch.cuda.6.total: 191.5 GB
  torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: AMD Radeon Graphics
  torch.cuda.7.free: 191.0 GB
  torch.cuda.7.total: 191.5 GB
  torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute)

llama_cpp_python:
  llama_cpp_python.version: 0.3.6
  llama_cpp_python.supports_gpu_offload: False

[azureuser@vshaw-rhelai-1 ~]$ ilab shell
(app-root) /$ rocm-smi 


============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp        Power     Partitions          SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%  
              (DID,     GUID)  (Junction)  (Socket)  (Mem, Compute, ID)                                                  
==========================================================================================================================
0       2     0x74b5,   65402  41.0°C      126.0W    NPS1, SPX, 0        161Mhz  900Mhz  0%   auto  750.0W  0%     0%    
1       3     0x74b5,   27175  39.0°C      123.0W    NPS1, SPX, 0        155Mhz  900Mhz  0%   auto  750.0W  0%     0%    
2       4     0x74b5,   16561  38.0°C      126.0W    NPS1, SPX, 0        158Mhz  900Mhz  0%   auto  750.0W  0%     0%    
3       5     0x74b5,   54764  40.0°C      127.0W    NPS1, SPX, 0        156Mhz  900Mhz  0%   auto  750.0W  0%     0%    
4       6     0x74b5,   10760  39.0°C      125.0W    NPS1, SPX, 0        159Mhz  900Mhz  0%   auto  750.0W  0%     0%    
5       7     0x74b5,   48981  41.0°C      125.0W    NPS1, SPX, 0        154Mhz  900Mhz  0%   auto  750.0W  0%     0%    
6       8     0x74b5,   32548  40.0°C      126.0W    NPS1, SPX, 0        153Mhz  900Mhz  0%   auto  750.0W  0%     0%    
7       9     0x74b5,   60025  39.0°C      125.0W    NPS1, SPX, 0        151Mhz  900Mhz  0%   auto  750.0W  0%     0%    
==========================================================================================================================

Assignee:: Prarit Bhargava

Reporter:: Vikash Shaw

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/06/18 10:24 PM

Updated:: 2025/07/08 1:33 PM

Resolved:: 2025/06/20 2:34 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates