Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4128

Training is not working on AMD

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • rhelai-1.5
    • rhelai-1.5
    • Accelerators - AMD
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Critical
    • Approved

      An attempt to run a training on AMD with RHEL AI 1.5-6 fails. Testing was done on Azure on MI300X.

      The executed command was: 

      time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train

      Ful logs will be attached, the relevant snippet is:

       

      [rank4]: Traceback (most recent call last):
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 1013, in <module>
      [rank4]:     main(args)
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 659, in main
      [rank4]:     model, lr_scheduler, optimizer, accelerator = setup_model(
      [rank4]:                                                   ^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 259, in setup_model
      [rank4]:     model = accelerator.prepare(model)
      [rank4]:             ^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1446, in prepare
      [rank4]:     result = tuple(
      [rank4]:              ^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1447, in <genexpr>
      [rank4]:     self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
      [rank4]:     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one
      [rank4]:     return self.prepare_model(obj, device_placement=device_placement)
      [rank4]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 1635, in prepare_model
      [rank4]:     fsdp_plugin.param_init_fn = ensure_weights_retied(
      [rank4]:                                 ^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/utils/fsdp_utils.py", line 396, in ensure_weights_retied
      [rank4]:     mod = model.get_submodule(name)
      [rank4]:           ^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank4]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/nn/modules/module.py", line 720, in get_submodule
      [rank4]:     raise AttributeError(
      [rank4]: AttributeError: GPTDolomiteForCausalLM has no attribute `lm_head`
      <snip>
      E0512 14:48:03.894000 646 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 650) of binary: /opt/app-root/bin/python3.11
      Traceback (most recent call last):
        File "/opt/app-root/bin/torchrun", line 8, in <module>
          sys.exit(main())
                   ^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
          return f(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 918, in main
          run(args)
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 909, in run
          elastic_launch(
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
      ============================================================
      /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py FAILED
      ------------------------------------------------------------
      Failures:
      [1]:
        time      : 2025-05-12_14:48:03
        host      : fzatlouk-rhelai-1.4-amd-test-westus
        rank      : 3 (local_rank: 3)
        exitcode  : 1 (pid: 651)
        error_file: <N/A>
        traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
      ------------------------------------------------------------
      Root Cause (first observed failure):
      [0]:
        time      : 2025-05-12_14:48:03
        host      : fzatlouk-rhelai-1.4-amd-test-westus
        rank      : 2 (local_rank: 2)
        exitcode  : 1 (pid: 650)
        error_file: <N/A>
        traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
      ============================================================
      Training subprocess has not exited yet. Sending SIGTERM.
      

       

              prarit@redhat.com Prarit Bhargava
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: