Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-3623

RHEL AI 1.4.1: Multi-phase training fails with a distributed training error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • rhelai-1.4.1
    • InstructLab - Training
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Sprint 1

      RHEL AI 1.4.1: Multi-phase training with Knowledge(7 epochs) and Skills(10 epochs) fails with a distributed training error

      To Reproduce Steps to reproduce the behavior:

      1. Bring up an AWS EC2 P5.48xlarge instance with RHEL AI 1.4.1 image.
      2. follow the instructions from this snippet until before kicking off the training command in line 39 - https://gitlab.cee.redhat.com/-/snippets/9540 
      3. When we run the multi-phase training command use this command instead
      time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl --phased-phase1-num-epochs 7 --phased-phase2-num-epochs 10 | tee iso-testrun/ilab-train
      1. Observe at the end of Epoch 6 in Phase 1, there is a transient error that could interrupt the multi-phase training.
      2. I have verified the training logs under the `.local/share/instructlab/phased/phase1/checkpoints/full_logs_global0.log` but that log also had the same results as the STDOUT.
      /opt/app-root/lib64/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
        warnings.warn(
      [19:46:58] INFO     Saving model to                             fsdp_utils.py:88
                          /var/home/cloud-user/.local/share/instructl
                          ab/phased/phase1/checkpoints/full_state/epo
                          ch_6/pytorch_model_fsdp.bin
      [19:47:10] INFO     Model saved to                              fsdp_utils.py:90
                          /var/home/cloud-user/.local/share/instructl
                          ab/phased/phase1/checkpoints/full_state/epo
                          ch_6/pytorch_model_fsdp.bin
                 INFO     FSDP Model saved to output dir           accelerator.py:3042
                          /var/home/cloud-user/.local/share/instru
                          ctlab/phased/phase1/checkpoints/full_sta
                          te/epoch_6
                 INFO     Saving FSDP Optimizer                    accelerator.py:3059
      [19:47:19] INFO     Saving Optimizer state to                  fsdp_utils.py:192
                          /var/home/cloud-user/.local/share/instruct
                          lab/phased/phase1/checkpoints/full_state/e
                          poch_6/optimizer.bin
      [19:47:42] INFO     Optimizer state saved in                   fsdp_utils.py:194
                          /var/home/cloud-user/.local/share/instruct
                          lab/phased/phase1/checkpoints/full_state/e
                          poch_6/optimizer.bin
                 INFO     FSDP Optimizer saved to output dir       accelerator.py:3061
                          /var/home/cloud-user/.local/share/instru
                          ctlab/phased/phase1/checkpoints/full_sta
                          te/epoch_6
                 INFO     Scheduler state saved in                checkpointing.py:118
                          /var/home/cloud-user/.local/share/instr
                          uctlab/phased/phase1/checkpoints/full_s
                          tate/epoch_6/scheduler.bin
                 INFO     Sampler state for dataloader 0 saved in checkpointing.py:135
                          /var/home/cloud-user/.local/share/instr
                          uctlab/phased/phase1/checkpoints/full_s
                          tate/epoch_6/sampler.bin
                 INFO     Random states saved in                  checkpointing.py:160
                          /var/home/cloud-user/.local/share/instr
                          uctlab/phased/phase1/checkpoints/full_s
                          tate/epoch_6/random_states_0.pkl
      ESC[93mSaving training state: {'current_epoch': 6, 'samples_seen': 35436}ESC[0m
      ESC[93mModel state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6ESC[0m
      ^MEpoch 6: 100%|<E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88>| 40/40 [02:41<00:00,  4.03s/it]
      (END) 

      Expected behavior

      • I would expect the multi-phase training to complete successfully and this happens some times with the same working conditions.
      Saving training state: {'current_epoch': 6, 'samples_seen': 35426}
      Model state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6
      Epoch 6: 100%|██████████| 40/40 [02:32<00:00,  3.81s/it]
      Operation completed successfully! 🎉
      Waiting for process to exit, 60s... 

       

      Actual Result:

      multiphase_training_RHELAI_141.log

      Saving training state: {'current_epoch': 6, 'samples_seen': 35436}
      Model state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6
      Epoch 6: 100%|██████████| 40/40 [02:41<00:00,  4.03s/it]
      Training subprocess has not exited yet. Sending SIGTERM.
      Waiting for process to exit, 60s...
      --- Logging error ---
      Traceback (most recent call last):
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase
          _training_phase(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase
          run_training(train_args=train_args, torch_args=torch_args)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
          return run_training(torch_args=torch_args, train_args=train_args)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 837, in run_training
          raise RuntimeError(
      RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context.
      
      During handling of the above exception, another exception occurred:
      
      Traceback (most recent call last):
        File "/usr/lib64/python3.11/logging/__init__.py", line 1110, in emit
          msg = self.format(record)
                ^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/logging/__init__.py", line 953, in format
          return fmt.format(record)
                 ^^^^^^^^^^^^^^^^^^
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py", line 19, in format
          return super().format(record)
                 ^^^^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/logging/__init__.py", line 687, in format
          record.message = record.getMessage()
                           ^^^^^^^^^^^^^^^^^^^
        File "/usr/lib64/python3.11/logging/__init__.py", line 377, in getMessage
          msg = msg % self.args
                ~~~~^~~~~~~~~~~
      TypeError: not all arguments converted during string formatting
      Call stack:
        File "/opt/app-root/bin/ilab", line 8, in <module>
          sys.exit(ilab())
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
          return self.main(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main
          rv = self.invoke(ctx)
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
          return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
          return _process_result(sub_ctx.command.invoke(sub_ctx))
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
          return ctx.invoke(self.callback, **ctx.params)
        File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
          return __callback(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
          return f(get_current_context(), *args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
          return f(*args, **kwargs)
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 469, in train
          accelerated_train.accelerated_train(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 202, in accelerated_train
          _run_phased_training(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 342, in _run_phased_training
          _run_phase(
        File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 276, in _run_phase
          logger.error("Failed during training loop: ", e)
      Message: 'Failed during training loop: '
      Arguments: (RuntimeError('Suffered a failure during distributed training. Please see the training logs for more context.'),)
      Accelerated Training failed with 1 

      Device Info (please complete the following information):

      • Hardware Specs: AWS P5..48xlarge
      • OS Version: Red Hat Enterprise Linux 9.20250213.0.4 (Plow)
      • InstructLab Version: 0.23.2
      • Provide the output of these two commands:
        • "registry.redhat.io/rhelai1/bootc-aws-nvidia-rhel9:1.4"
        • ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
          ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
          ggml_cuda_init: found 8 CUDA devices:
            Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
            Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
          Platform:
            sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
            sys.platform: linux
            os.name: posix
            platform.release: 5.14.0-427.50.2.el9_4.x86_64
            platform.machine: x86_64
            platform.node: ip-172-31-0-67.us-east-2.compute.internal
            platform.python_version: 3.11.7
            os-release.ID: rhel
            os-release.VERSION_ID: 9.4
            os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
            memory.total: 1999.96 GB
            memory.available: 1938.05 GB
            memory.used: 36.38 GBInstructLab:
            instructlab.version: 0.23.2
            instructlab-dolomite.version: 0.2.0
            instructlab-eval.version: 0.5.1
            instructlab-quantize.version: 0.1.0
            instructlab-schema.version: 0.4.2
            instructlab-sdg.version: 0.7.1
            instructlab-training.version: 0.7.0Torch:
            torch.version: 2.5.1
            torch.backends.cpu.capability: AVX2
            torch.version.cuda: 12.4
            torch.version.hip: None
            torch.cuda.available: True
            torch.backends.cuda.is_built: True
            torch.backends.mps.is_built: False
            torch.backends.mps.is_available: False
            torch.cuda.bf16: True
            torch.cuda.current.device: 0
            torch.cuda.0.name: NVIDIA H100 80GB HBM3
            torch.cuda.0.free: 5.2 GB
            torch.cuda.0.total: 79.1 GB
            torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.1.name: NVIDIA H100 80GB HBM3
            torch.cuda.1.free: 5.1 GB
            torch.cuda.1.total: 79.1 GB
            torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.2.name: NVIDIA H100 80GB HBM3
            torch.cuda.2.free: 4.7 GB
            torch.cuda.2.total: 79.1 GB
            torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.3.name: NVIDIA H100 80GB HBM3
            torch.cuda.3.free: 4.9 GB
            torch.cuda.3.total: 79.1 GB
            torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.4.name: NVIDIA H100 80GB HBM3
            torch.cuda.4.free: 5.1 GB
            torch.cuda.4.total: 79.1 GB
            torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.5.name: NVIDIA H100 80GB HBM3
            torch.cuda.5.free: 4.9 GB
            torch.cuda.5.total: 79.1 GB
            torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.6.name: NVIDIA H100 80GB HBM3
            torch.cuda.6.free: 4.9 GB
            torch.cuda.6.total: 79.1 GB
            torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
            torch.cuda.7.name: NVIDIA H100 80GB HBM3
            torch.cuda.7.free: 5.1 GB
            torch.cuda.7.total: 79.1 GB
            torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
            llama_cpp_python.version: 0.3.2
            llama_cpp_python.supports_gpu_offload: True 

      Bug impact

      • It interrupts the phase 1 training progress and the user would have to repeat the multi-phase training resulting in reduced GPU utilization and higher overall costs

      Known workaround

      • Re-run the worklfow, there are no changes needed.

      Additional context

      • This was also observed on the AMD machine by fzatlouk@redhat.com and the workaround was the same to re-run this without any change.

              rhn-support-jkunstle James Kunstle (Inactive)
              kakella@redhat.com Kamesh Akella
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: