Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4137

Training is not working on AMD with OOM

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False
    • Critical

      An attempt to run a training on AMD with RHEL AI 1.5-6 fails with fix from https://issues.redhat.com/browse/RHELAI-4128 applied . Testing was done on Azure on MI300X.

      The executed command was: 

      time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_reduced.jsonl --phased-phase1-num-epochs 2 --phased-phase2-num-epochs 2 | tee iso-testrun/ilab-train

      The relevant snippet is:

      Epoch: 0, Step: 8, Rank: 2, loss = 0.4657960534095764
      Epoch: 0, Step: 8, Rank: 1, loss = 0.524458646774292
      Epoch: 0, Step: 8, Rank: 0, loss = 0.344452440738678
      Epoch: 0, Step: 8, Rank: 3, loss = 0.5397558808326721
      Epoch: 0, Step: 8, Rank: 4, loss = 0.7030256390571594
      Epoch: 0, Step: 8, Rank: 6, loss = 0.8349191546440125
      Epoch: 0, Step: 8, Rank: 5, loss = 0.8995961546897888
      Epoch: 0, Step: 8, Rank: 7, loss = 0.6992480158805847
      [rank2]: Traceback (most recent call last):
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 1013, in <module>
      [rank2]:     main(args)
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 665, in main
      [rank2]:     train(
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 437, in train
      [rank2]:     accelerator.backward(loss)
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 2454, in backward
      [rank2]:     loss.backward(**kwargs)
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/_tensor.py", line 626, in backward
      [rank2]:     torch.autograd.backward(
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
      [rank2]:     _engine_run_backward(
      [rank2]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
      [rank2]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
      [rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank2]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 24.93 GiB. GPU 2 has a total capacity of 191.45 GiB of which 21.64 GiB is free. Of the allocated memory 123.73 GiB is allocated by PyTorch, and 39.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
      [rank5]: Traceback (most recent call last):
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 1013, in <module>
      [rank5]:     main(args)
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 665, in main
      [rank5]:     train(
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 437, in train
      [rank5]:     accelerator.backward(loss)
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/accelerate/accelerator.py", line 2454, in backward
      [rank5]:     loss.backward(**kwargs)
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/_tensor.py", line 626, in backward
      [rank5]:     torch.autograd.backward(
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/__init__.py", line 347, in backward
      [rank5]:     _engine_run_backward(
      [rank5]:   File "/opt/app-root/lib64/python3.11/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
      [rank5]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
      [rank5]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      [rank5]: torch.OutOfMemoryError: HIP out of memory. Tried to allocate 24.93 GiB. GPU 5 has a total capacity of 191.45 GiB of which 21.77 GiB is free. Of the allocated memory 123.73 GiB is allocated by PyTorch, and 39.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
      W0513 13:41:06.536000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1965 closing signal SIGTERM
      W0513 13:41:06.538000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1966 closing signal SIGTERM
      W0513 13:41:06.538000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1968 closing signal SIGTERM
      W0513 13:41:06.540000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1969 closing signal SIGTERM
      W0513 13:41:06.541000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1971 closing signal SIGTERM
      W0513 13:41:06.542000 1963 torch/distributed/elastic/multiprocessing/api.py:897] Sending process 1972 closing signal SIGTERM
      E0513 13:41:07.924000 1963 torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 2 (pid: 1967) of binary: /opt/app-root/bin/python3.11
      Traceback (most recent call last):
        File "/opt/app-root/bin/torchrun", line 8, in <module>
          sys.exit(main())
                   ^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
          return f(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 918, in main
          run(args)
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/run.py", line 909, in run
          elastic_launch(
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
      ============================================================
      /opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py FAILED
      ------------------------------------------------------------
      Failures:
      [1]:
        time      : 2025-05-13_13:41:06
        host      : fzatlouk-rhelai-1.4-amd-test-westus
        rank      : 5 (local_rank: 5)
        exitcode  : 1 (pid: 1970)
        error_file: <N/A>
        traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
      ------------------------------------------------------------
      Root Cause (first observed failure):
      [0]:
        time      : 2025-05-13_13:41:06
        host      : fzatlouk-rhelai-1.4-amd-test-westus
        rank      : 2 (local_rank: 2)
        exitcode  : 1 (pid: 1967)
        error_file: <N/A>
        traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
      ============================================================
      Training subprocess has not exited yet. Sending SIGTERM.
      Waiting for process to exit, 60s...
      ERROR 2025-05-13 13:41:10,024 instructlab.model.accelerated_train:276: Failed during training loop: Suffered a failure during distributed training. Please see the training logs for more context.
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase
          _training_phase(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase
          run_training(train_args=train_args, torch_args=torch_args)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
          return run_training(torch_args=torch_args, train_args=train_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 863, in run_training
          raise RuntimeError(
      RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context.
      Accelerated Training failed with 1

              xdong@redhat.com Xiyang Dong
              fzatlouk@redhat.com František Zatloukal
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: