-
Bug
-
Resolution: Done
-
Critical
-
None
-
rhelai-1.5.2
-
None
-
False
-
-
False
-
-
-
Critical
Observation:
total length: 109306 num samples 65 - rank: 3 num_loss_counted_tokens: 38502 [21:53:27] INFO Optimizer state saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/optimizer.bin fsdp_utils.py:231 INFO FSDP Optimizer saved to output dir /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0 accelerator.py:3272 INFO Scheduler state saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/scheduler.bin checkpointing.py:122 INFO Sampler state for dataloader 0 saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/sampler.bin checkpointing.py:139 INFO Random states saved in /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0/random_states_0.pkl checkpointing.py:170 Saving training state: {'current_epoch': 0, 'samples_seen': 14558} Model state saved in: /var/home/azureuser/.local/share/instructlab/phased/phase2/checkpoints/full_state/epoch_0 Epoch 0: 100%|██████████| 19/19 [11:53<00:00, 37.57s/it] total length: 109303 num samples 16 - rank: 0 num_loss_counted_tokens: 6956 total length: 109307 num samples 16 - rank: 0 num_loss_counted_tokens: 25582 total length: 109204 num samples 17 - rank: 0 num_loss_counted_tokens: 33791 total length: 109284 num samples 14 - rank: 0 num_loss_counted_tokens: 13911 total length: 109299 num samples 15 - rank: 0 num_loss_counted_tokens: 16610 total length: 109296 num samples 16 - rank: 0 num_loss_counted_tokens: 29114 total length: 109245 num samples 16 - rank: 0 num_loss_counted_tokens: 29705 total length: 109217 num samples 15 - rank: 0 num_loss_counted_tokens: 14151 total length: 109308 num samples 19 - rank: 0 num_loss_counted_tokens: 36587 total length: 109297 num samples 17 - rank: 0 num_loss_counted_tokens: 15037 total length: 109245 num samples 15 - rank: 0 num_loss_counted_tokens: 8638 total length: 109251 num samples 14 - rank: 0 num_loss_counted_tokens: 8535 total length: 109261 num samples 13 - rank: 0 num_loss_counted_tokens: 10475 total length: 109308 num samples 17 - rank: 0 num_loss_counted_tokens: 10561 total length: 109305 num samples 16 - rank: 0 num_loss_counted_tokens: 9775 total length: 109291 num samples 16 - rank: 0 num_loss_counted_tokens: 12199 total length: 109306 num samples 15 - rank: 0 num_loss_counted_tokens: 5160 HW Exception by GPU node-8 (Agent handle: 0x7fd142697900) reason :GPU Hang real 41m0.610s user 0m1.842s sys 0m1.511s [azureuser@vshaw-rhelai-
Steps to reproduce the behaviour:
- Bring up RHELAI on top of Azure AMD MI300X *8 cores GPU with bootc switch 1.5.2
- Use : registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718
- Follow the instructions from this snippet until ilab model chat- https://gitlab.cee.redhat.com/-/snippets/9540
[azureuser@vshaw-rhelai-1 ~]$ sudo bootc status apiVersion: org.containers.bootc/v1alpha1 kind: BootcHost metadata: name: host spec: image: image: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718 transport: registry bootOrder: default status: staged: null booted: image: image: image: registry.stage.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.2-1750191718 transport: registry version: 9.20250429.0 timestamp: null imageDigest: sha256:1c1556421769a66cb584703febba10ebe23d6af091916eec0d5139a40082228d cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: 017bf20fd108c7136df3d2a21d4ff573dcfc4136d51b97b52bbd1ee955350702 deploySerial: 0 rollback: image: image: image: registry.redhat.io/rhelai1/bootc-azure-amd-rhel9:1.5.1-1749212837 transport: registry version: 9.20250429.0 timestamp: null imageDigest: sha256:d4c9658b63a4fa3412fa2bbcc5af252e71d5319223f3be0b3fbf24e20c1944ce cachedUpdate: null incompatible: false pinned: false store: ostreeContainer ostree: checksum: ca88d45a43d564f644d47a03e19fbb8dc9e4ce822cf39ac9560743780c70b25d deploySerial: 0 rollbackQueued: false type: bootcHost
[azureuser@vshaw-rhelai-1 ~]$ ilab model list +------------------------------------+---------------------+---------+---------------------------------------------------------------------------+ | Model Name | Last Modified | Size | Absolute path | +------------------------------------+---------------------+---------+---------------------------------------------------------------------------+ | models/granite-3.1-8b-lab-v2.1 | 2025-06-18 15:03:44 | 31.2 GB | /var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-lab-v2.1 | | models/granite-3.1-8b-starter-v2.1 | 2025-06-18 15:04:20 | 31.2 GB | /var/home/azureuser/.cache/instructlab/models/granite-3.1-8b-starter-v2.1 | | models/mixtral-8x7b-instruct-v0-1 | 2025-06-18 15:04:56 | 87.0 GB | /var/home/azureuser/.cache/instructlab/models/mixtral-8x7b-instruct-v0-1 | | models/prometheus-8x7b-v2-0 | 2025-06-18 15:05:32 | 87.0 GB | /var/home/azureuser/.cache/instructlab/models/prometheus-8x7b-v2-0 | +------------------------------------+---------------------+---------+---------------------------------------------------------------------------+
[azureuser@vshaw-rhelai-1 ~]$ amd-smi list GPU: 0 BDF: 0002:00:00.0 UUID: 00ff74b5-0000-1000-8084-713a2fc8c56f KFD_ID: 65402 NODE_ID: 2 PARTITION_ID: 0 GPU: 1 BDF: 0003:00:00.0 UUID: 84ff74b5-0000-1000-800c-188e6ee2f7e9 KFD_ID: 27175 NODE_ID: 3 PARTITION_ID: 0 GPU: 2 BDF: 0004:00:00.0 UUID: 73ff74b5-0000-1000-806e-7aa0bffb6f26 KFD_ID: 16561 NODE_ID: 4 PARTITION_ID: 0 GPU: 3 BDF: 0005:00:00.0 UUID: c1ff74b5-0000-1000-80cf-f2e39ad707f4 KFD_ID: 54764 NODE_ID: 5 PARTITION_ID: 0 GPU: 4 BDF: 0006:00:00.0 UUID: afff74b5-0000-1000-80a0-cceca0221f4b KFD_ID: 10760 NODE_ID: 6 PARTITION_ID: 0 GPU: 5 BDF: 0007:00:00.0 UUID: 82ff74b5-0000-1000-8054-8e91aeb33171 KFD_ID: 48981 NODE_ID: 7 PARTITION_ID: 0 GPU: 6 BDF: 0008:00:00.0 UUID: f8ff74b5-0000-1000-809b-e74b559d0a24 KFD_ID: 32548 NODE_ID: 8 PARTITION_ID: 0 GPU: 7 BDF: 0009:00:00.0 UUID: f7ff74b5-0000-1000-805d-726356524b5a KFD_ID: 60025 NODE_ID: 9 PARTITION_ID: 0
[azureuser@vshaw-rhelai-1 ~]$ ilab system info Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.65.1.el9_4.x86_64 platform.machine: x86_64 platform.node: vshaw-rhelai-1.4-amd-test-westus platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1820.96 GB memory.available: 1784.92 GB memory.used: 28.53 GB InstructLab: instructlab.version: 0.26.1 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.8.3 instructlab-training.version: 0.10.3 Torch: torch.version: 2.6.0 torch.backends.cpu.capability: AVX512 torch.version.cuda: None torch.version.hip: 6.3.42134-a9a80e791 torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: AMD Radeon Graphics torch.cuda.0.free: 191.0 GB torch.cuda.0.total: 191.5 GB torch.cuda.0.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: AMD Radeon Graphics torch.cuda.1.free: 191.0 GB torch.cuda.1.total: 191.5 GB torch.cuda.1.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: AMD Radeon Graphics torch.cuda.2.free: 191.0 GB torch.cuda.2.total: 191.5 GB torch.cuda.2.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: AMD Radeon Graphics torch.cuda.3.free: 191.0 GB torch.cuda.3.total: 191.5 GB torch.cuda.3.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: AMD Radeon Graphics torch.cuda.4.free: 191.0 GB torch.cuda.4.total: 191.5 GB torch.cuda.4.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: AMD Radeon Graphics torch.cuda.5.free: 191.0 GB torch.cuda.5.total: 191.5 GB torch.cuda.5.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: AMD Radeon Graphics torch.cuda.6.free: 191.0 GB torch.cuda.6.total: 191.5 GB torch.cuda.6.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: AMD Radeon Graphics torch.cuda.7.free: 191.0 GB torch.cuda.7.total: 191.5 GB torch.cuda.7.capability: 9.4 (see https://developer.nvidia.com/cuda-gpus#compute) llama_cpp_python: llama_cpp_python.version: 0.3.6 llama_cpp_python.supports_gpu_offload: False
[azureuser@vshaw-rhelai-1 ~]$ ilab shell
(app-root) /$ rocm-smi
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Junction) (Socket) (Mem, Compute, ID)
==========================================================================================================================
0 2 0x74b5, 65402 41.0°C 126.0W NPS1, SPX, 0 161Mhz 900Mhz 0% auto 750.0W 0% 0%
1 3 0x74b5, 27175 39.0°C 123.0W NPS1, SPX, 0 155Mhz 900Mhz 0% auto 750.0W 0% 0%
2 4 0x74b5, 16561 38.0°C 126.0W NPS1, SPX, 0 158Mhz 900Mhz 0% auto 750.0W 0% 0%
3 5 0x74b5, 54764 40.0°C 127.0W NPS1, SPX, 0 156Mhz 900Mhz 0% auto 750.0W 0% 0%
4 6 0x74b5, 10760 39.0°C 125.0W NPS1, SPX, 0 159Mhz 900Mhz 0% auto 750.0W 0% 0%
5 7 0x74b5, 48981 41.0°C 125.0W NPS1, SPX, 0 154Mhz 900Mhz 0% auto 750.0W 0% 0%
6 8 0x74b5, 32548 40.0°C 126.0W NPS1, SPX, 0 153Mhz 900Mhz 0% auto 750.0W 0% 0%
7 9 0x74b5, 60025 39.0°C 125.0W NPS1, SPX, 0 151Mhz 900Mhz 0% auto 750.0W 0% 0%
==========================================================================================================================