-
Bug
-
Resolution: Done
-
Undefined
-
rhelai-1.5
-
None
-
False
-
-
False
-
-
-
Critical
-
Approved
An attempt to run a training on AMD with RHEL AI 1.5-6 fails. Testing was done on Azure on MI300X.
The executed command was:
time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train
Ful logs will be attached, the relevant snippet is:
[rank4]: Traceback (most recent call last): [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 1013, in <module> [rank4]: main(args) [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 659, in main [rank4]: model, lr_scheduler, optimizer, accelerator = setup_model( [rank4]: ^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 259, in setup_model [rank4]: model = accelerator.prepare(model) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1446, in prepare [rank4]: result = tuple( [rank4]: ^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1447, in <genexpr> [rank4]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one [rank4]: return self.prepare_model(obj, device_placement=device_placement) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1635, in prepare_model [rank4]: fsdp_plugin.param_init_fn = ensure_weights_retied( [rank4]: ^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 396, in ensure_weights_retied [rank4]: mod = model.get_submodule(name) [rank4]: ^^^^^^^^^^^^^^^^^^^^^^^^^ [rank4]: File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 720, in get_submodule [rank4]: raise AttributeError( [rank4]: AttributeError: GPTDolomiteForCausalLM has no attribute `lm_head` <snip> E0512 14:48:03.894000 646 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 650) of binary: /opt/app-root/bin/python3.11 Traceback (most recent call last): File "/opt/app-root/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 918, in main run(args) File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 909, in run elastic_launch( File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py FAILED ------------------------------------------------------------ Failures: [1]: time : 2025-05-12_14:48:03 host : fzatlouk-rhelai-1.4-amd-test-westus rank : 3 (local_rank: 3) exitcode : 1 (pid: 651) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-05-12_14:48:03 host : fzatlouk-rhelai-1.4-amd-test-westus rank : 2 (local_rank: 2) exitcode : 1 (pid: 650) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ Training subprocess has not exited yet. Sending SIGTERM.
- is blocked by
-
PROJQUAY-8925 Quay.io image pushes fail with a 500 Internal Server Error
-
- Closed
-
- mentioned on
(2 mentioned on)