Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: rhelai-1.4.1
Component/s: InstructLab - Training
Labels:
None

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
Sprint 1

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

RHEL AI 1.4.1: Multi-phase training with Knowledge(7 epochs) and Skills(10 epochs) fails with a distributed training error

To Reproduce Steps to reproduce the behavior:

Bring up an AWS EC2 P5.48xlarge instance with RHEL AI 1.4.1 image.
follow the instructions from this snippet until before kicking off the training command in line 39 - https://gitlab.cee.redhat.com/-/snippets/9540
When we run the multi-phase training command use this command instead

time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl --phased-phase1-num-epochs 7 --phased-phase2-num-epochs 10 | tee iso-testrun/ilab-train

Observe at the end of Epoch 6 in Phase 1, there is a transient error that could interrupt the multi-phase training.
I have verified the training logs under the `.local/share/instructlab/phased/phase1/checkpoints/full_logs_global0.log` but that log also had the same results as the STDOUT.

/opt/app-root/lib64/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html .
  warnings.warn(
[19:46:58] INFO     Saving model to                             fsdp_utils.py:88
                    /var/home/cloud-user/.local/share/instructl
                    ab/phased/phase1/checkpoints/full_state/epo
                    ch_6/pytorch_model_fsdp.bin
[19:47:10] INFO     Model saved to                              fsdp_utils.py:90
                    /var/home/cloud-user/.local/share/instructl
                    ab/phased/phase1/checkpoints/full_state/epo
                    ch_6/pytorch_model_fsdp.bin
           INFO     FSDP Model saved to output dir           accelerator.py:3042
                    /var/home/cloud-user/.local/share/instru
                    ctlab/phased/phase1/checkpoints/full_sta
                    te/epoch_6
           INFO     Saving FSDP Optimizer                    accelerator.py:3059
[19:47:19] INFO     Saving Optimizer state to                  fsdp_utils.py:192
                    /var/home/cloud-user/.local/share/instruct
                    lab/phased/phase1/checkpoints/full_state/e
                    poch_6/optimizer.bin
[19:47:42] INFO     Optimizer state saved in                   fsdp_utils.py:194
                    /var/home/cloud-user/.local/share/instruct
                    lab/phased/phase1/checkpoints/full_state/e
                    poch_6/optimizer.bin
           INFO     FSDP Optimizer saved to output dir       accelerator.py:3061
                    /var/home/cloud-user/.local/share/instru
                    ctlab/phased/phase1/checkpoints/full_sta
                    te/epoch_6
           INFO     Scheduler state saved in                checkpointing.py:118
                    /var/home/cloud-user/.local/share/instr
                    uctlab/phased/phase1/checkpoints/full_s
                    tate/epoch_6/scheduler.bin
           INFO     Sampler state for dataloader 0 saved in checkpointing.py:135
                    /var/home/cloud-user/.local/share/instr
                    uctlab/phased/phase1/checkpoints/full_s
                    tate/epoch_6/sampler.bin
           INFO     Random states saved in                  checkpointing.py:160
                    /var/home/cloud-user/.local/share/instr
                    uctlab/phased/phase1/checkpoints/full_s
                    tate/epoch_6/random_states_0.pkl
ESC[93mSaving training state: {'current_epoch': 6, 'samples_seen': 35436}ESC[0m
ESC[93mModel state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6ESC[0m
^MEpoch 6: 100%|<E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88>| 40/40 [02:41<00:00,  4.03s/it]
(END)

Expected behavior

I would expect the multi-phase training to complete successfully and this happens some times with the same working conditions.

Saving training state: {'current_epoch': 6, 'samples_seen': 35426}
Model state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6
Epoch 6: 100%|██████████| 40/40 [02:32<00:00,  3.81s/it]
Operation completed successfully! 🎉
Waiting for process to exit, 60s...

Actual Result:

multiphase_training_RHELAI_141.log

Saving training state: {'current_epoch': 6, 'samples_seen': 35436}
Model state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6
Epoch 6: 100%|██████████| 40/40 [02:41<00:00,  4.03s/it]
Training subprocess has not exited yet. Sending SIGTERM.
Waiting for process to exit, 60s...
--- Logging error ---
Traceback (most recent call last):
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase
    _training_phase(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase
    run_training(train_args=train_args, torch_args=torch_args)
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training
    return run_training(torch_args=torch_args, train_args=train_args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 837, in run_training
    raise RuntimeError(
RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib64/python3.11/logging/__init__.py", line 1110, in emit
    msg = self.format(record)
          ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/logging/__init__.py", line 953, in format
    return fmt.format(record)
           ^^^^^^^^^^^^^^^^^^
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py", line 19, in format
    return super().format(record)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/logging/__init__.py", line 687, in format
    record.message = record.getMessage()
                     ^^^^^^^^^^^^^^^^^^^
  File "/usr/lib64/python3.11/logging/__init__.py", line 377, in getMessage
    msg = msg % self.args
          ~~~~^~~~~~~~~~~
TypeError: not all arguments converted during string formatting
Call stack:
  File "/opt/app-root/bin/ilab", line 8, in <module>
    sys.exit(ilab())
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main
    rv = self.invoke(ctx)
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper
    return f(*args, **kwargs)
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 469, in train
    accelerated_train.accelerated_train(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 202, in accelerated_train
    _run_phased_training(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 342, in _run_phased_training
    _run_phase(
  File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 276, in _run_phase
    logger.error("Failed during training loop: ", e)
Message: 'Failed during training loop: '
Arguments: (RuntimeError('Suffered a failure during distributed training. Please see the training logs for more context.'),)
Accelerated Training failed with 1

Device Info (please complete the following information):

Hardware Specs: AWS P5..48xlarge
OS Version: Red Hat Enterprise Linux 9.20250213.0.4 (Plow)
InstructLab Version: 0.23.2

Provide the output of these two commands:

"registry.redhat.io/rhelai1/bootc-aws-nvidia-rhel9:1.4"

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
  Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes
Platform:
  sys.version: 3.11.7 (main, Jan  8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)]
  sys.platform: linux
  os.name: posix
  platform.release: 5.14.0-427.50.2.el9_4.x86_64
  platform.machine: x86_64
  platform.node: ip-172-31-0-67.us-east-2.compute.internal
  platform.python_version: 3.11.7
  os-release.ID: rhel
  os-release.VERSION_ID: 9.4
  os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow)
  memory.total: 1999.96 GB
  memory.available: 1938.05 GB
  memory.used: 36.38 GBInstructLab:
  instructlab.version: 0.23.2
  instructlab-dolomite.version: 0.2.0
  instructlab-eval.version: 0.5.1
  instructlab-quantize.version: 0.1.0
  instructlab-schema.version: 0.4.2
  instructlab-sdg.version: 0.7.1
  instructlab-training.version: 0.7.0Torch:
  torch.version: 2.5.1
  torch.backends.cpu.capability: AVX2
  torch.version.cuda: 12.4
  torch.version.hip: None
  torch.cuda.available: True
  torch.backends.cuda.is_built: True
  torch.backends.mps.is_built: False
  torch.backends.mps.is_available: False
  torch.cuda.bf16: True
  torch.cuda.current.device: 0
  torch.cuda.0.name: NVIDIA H100 80GB HBM3
  torch.cuda.0.free: 5.2 GB
  torch.cuda.0.total: 79.1 GB
  torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.1.name: NVIDIA H100 80GB HBM3
  torch.cuda.1.free: 5.1 GB
  torch.cuda.1.total: 79.1 GB
  torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.2.name: NVIDIA H100 80GB HBM3
  torch.cuda.2.free: 4.7 GB
  torch.cuda.2.total: 79.1 GB
  torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.3.name: NVIDIA H100 80GB HBM3
  torch.cuda.3.free: 4.9 GB
  torch.cuda.3.total: 79.1 GB
  torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.4.name: NVIDIA H100 80GB HBM3
  torch.cuda.4.free: 5.1 GB
  torch.cuda.4.total: 79.1 GB
  torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.5.name: NVIDIA H100 80GB HBM3
  torch.cuda.5.free: 4.9 GB
  torch.cuda.5.total: 79.1 GB
  torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.6.name: NVIDIA H100 80GB HBM3
  torch.cuda.6.free: 4.9 GB
  torch.cuda.6.total: 79.1 GB
  torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)
  torch.cuda.7.name: NVIDIA H100 80GB HBM3
  torch.cuda.7.free: 5.1 GB
  torch.cuda.7.total: 79.1 GB
  torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python:
  llama_cpp_python.version: 0.3.2
  llama_cpp_python.supports_gpu_offload: True

Bug impact

It interrupts the phase 1 training progress and the user would have to repeat the multi-phase training resulting in reduced GPU utilization and higher overall costs

Known workaround

Re-run the worklfow, there are no changes needed.

Additional context

This was also observed on the AMD machine by fzatlouk@redhat.com and the workaround was the same to re-run this without any change.
…

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

multiphase_training_RHELAI_141.log
2025/03/11 7:20 PM
665 kB
Kamesh Akella

relates to

RHELAI-3879 ilab train AssertionError: MultiQueryAttention should have 1 head for keys and values

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates