-
Bug
-
Resolution: Done
-
Undefined
-
rhelai-1.5
-
None
-
False
-
-
False
-
-
-
Critical
An attempt to run a training on AMD with RHEL AI 1.5-6 fails with fix from https://issues.redhat.com/browse/RHELAI-4128 applied . Testing was done on Azure on MI300X.
The executed command was:
time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train
The relevant snippet is:
Epoch: 0, Step: 8, Rank: 2, loss = 0.4657960534095764 Epoch: 0, Step: 8, Rank: 1, loss = 0.524458646774292 Epoch: 0, Step: 8, Rank: 0, loss = 0.344452440738678 Epoch: 0, Step: 8, Rank: 3, loss = 0.5397558808326721 Epoch: 0, Step: 8, Rank: 4, loss = 0.7030256390571594 Epoch: 0, Step: 8, Rank: 6, loss = 0.8349191546440125 Epoch: 0, Step: 8, Rank: 5, loss = 0.8995961546897888 Epoch: 0, Step: 8, Rank: 7, loss = 0.6992480158805847 [rank2]: Traceback (most recent call last): [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 1013, in <module> [rank2]: main(args) [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 665, in main [rank2]: train( [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 437, in train [rank2]: accelerator.backward(loss) [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 2454, in backward [rank2]: loss.backward(**kwargs) [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/torch/_tensor.py", line 626, in backward [rank2]: torch.autograd.backward( [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward [rank2]: _engine_run_backward( [rank2]: File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward [rank2]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank2]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank2]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 24.93 GiB. GPU 2 has a total capacity of 191.45 GiB of which 21.64 GiB is free. Of the allocated memory 123.73 GiB is allocated by PyTorch, and 39.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) [rank5]: Traceback (most recent call last): [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 1013, in <module> [rank5]: main(args) [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 665, in main [rank5]: train( [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 437, in train [rank5]: accelerator.backward(loss) [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 2454, in backward [rank5]: loss.backward(**kwargs) [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/torch/_tensor.py", line 626, in backward [rank5]: torch.autograd.backward( [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward [rank5]: _engine_run_backward( [rank5]: File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward [rank5]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass [rank5]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank5]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 24.93 GiB. GPU 5 has a total capacity of 191.45 GiB of which 21.77 GiB is free. Of the allocated memory 123.73 GiB is allocated by PyTorch, and 39.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) W0513 13:41:06.536000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1965 closing signal SIGTERM W0513 13:41:06.538000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1966 closing signal SIGTERM W0513 13:41:06.538000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1968 closing signal SIGTERM W0513 13:41:06.540000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1969 closing signal SIGTERM W0513 13:41:06.541000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1971 closing signal SIGTERM W0513 13:41:06.542000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1972 closing signal SIGTERM E0513 13:41:07.924000 1963 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 1967) of binary: /opt/app-root/bin/python3.11 Traceback (most recent call last): File "/opt/app-root/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-05-13_13:41:06 host : fzatlouk-rhelai-1.4-amd-test-westus rank : 5 (local_rank: 5) exitcode : 1 (pid: 1970) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-13_13:41:06 host : fzatlouk-rhelai-1.4-amd-test-westus rank : 2 (local_rank: 2) exitcode : 1 (pid: 1967) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Training subprocess has not exited yet. Sending SIGTERM. Waiting for process to exit, 60s... ERROR 2025-05-13 13:41:10,024 instructlab.model.accelerated_train:276: Failed during training loop: Suffered a failure during distributed training. Please see the training logs for more context. Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase _training_phase( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase run_training(train_args=train_args, torch_args=torch_args) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training return run_training(torch_args=torch_args, train_args=train_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 863, in run_training raise RuntimeError( RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context. Accelerated Training failed with 1