-
Bug
-
Resolution: Done
-
Undefined
-
None
-
rhelai-1.4.1
-
None
-
False
-
-
False
-
-
-
Sprint 1
RHEL AI 1.4.1: Multi-phase training with Knowledge(7 epochs) and Skills(10 epochs) fails with a distributed training error
To Reproduce Steps to reproduce the behavior:
- Bring up an AWS EC2 P5.48xlarge instance with RHEL AI 1.4.1 image.
- follow the instructions from this snippet until before kicking off the training command in line 39 - https://gitlab.cee.redhat.com/-/snippets/9540
- When we run the multi-phase training command use this command instead
time ilab model train -y --force-clear-phased-cache --enable-serving-output --strategy lab-multiphase --phased-phase1-data ~/.local/share/instructlab/datasets/`ls -1 ~/.local/share/instructlab/datasets/ | head -n1`/knowledge_train_msgs_*.jsonl --phased-phase2-data ~/.local/share/instructlab/datasets/`ls -1 .local/share/instructlab/datasets/ | head -n1`/skills_train_msgs_*.jsonl --phased-phase1-num-epochs 7 --phased-phase2-num-epochs 10 | tee iso-testrun/ilab-train
- Observe at the end of Epoch 6 in Phase 1, there is a transient error that could interrupt the multi-phase training.
- I have verified the training logs under the `.local/share/instructlab/phased/phase1/checkpoints/full_logs_global0.log` but that log also had the same results as the STDOUT.
/opt/app-root/lib64/python3.11/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:690: FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP. API doc: https://pytorch.org/docs/stable/distributed.checkpoint.html#torch.distributed.checkpoint.state_dict.get_state_dict .Tutorial: https://pytorch.org/tutorials/recipes/distributed_checkpoint_recipe.html . warnings.warn( [19:46:58] INFO Saving model to fsdp_utils.py:88 /var/home/cloud-user/.local/share/instructl ab/phased/phase1/checkpoints/full_state/epo ch_6/pytorch_model_fsdp.bin [19:47:10] INFO Model saved to fsdp_utils.py:90 /var/home/cloud-user/.local/share/instructl ab/phased/phase1/checkpoints/full_state/epo ch_6/pytorch_model_fsdp.bin INFO FSDP Model saved to output dir accelerator.py:3042 /var/home/cloud-user/.local/share/instru ctlab/phased/phase1/checkpoints/full_sta te/epoch_6 INFO Saving FSDP Optimizer accelerator.py:3059 [19:47:19] INFO Saving Optimizer state to fsdp_utils.py:192 /var/home/cloud-user/.local/share/instruct lab/phased/phase1/checkpoints/full_state/e poch_6/optimizer.bin [19:47:42] INFO Optimizer state saved in fsdp_utils.py:194 /var/home/cloud-user/.local/share/instruct lab/phased/phase1/checkpoints/full_state/e poch_6/optimizer.bin INFO FSDP Optimizer saved to output dir accelerator.py:3061 /var/home/cloud-user/.local/share/instru ctlab/phased/phase1/checkpoints/full_sta te/epoch_6 INFO Scheduler state saved in checkpointing.py:118 /var/home/cloud-user/.local/share/instr uctlab/phased/phase1/checkpoints/full_s tate/epoch_6/scheduler.bin INFO Sampler state for dataloader 0 saved in checkpointing.py:135 /var/home/cloud-user/.local/share/instr uctlab/phased/phase1/checkpoints/full_s tate/epoch_6/sampler.bin INFO Random states saved in checkpointing.py:160 /var/home/cloud-user/.local/share/instr uctlab/phased/phase1/checkpoints/full_s tate/epoch_6/random_states_0.pkl ESC[93mSaving training state: {'current_epoch': 6, 'samples_seen': 35436}ESC[0m ESC[93mModel state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6ESC[0m ^MEpoch 6: 100%|<E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88><E2><96><88>| 40/40 [02:41<00:00, 4.03s/it] (END)
Expected behavior
- I would expect the multi-phase training to complete successfully and this happens some times with the same working conditions.
Saving training state: {'current_epoch': 6, 'samples_seen': 35426} Model state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6 Epoch 6: 100%|██████████| 40/40 [02:32<00:00, 3.81s/it] Operation completed successfully! 🎉 Waiting for process to exit, 60s...
Actual Result:
multiphase_training_RHELAI_141.log
Saving training state: {'current_epoch': 6, 'samples_seen': 35436} Model state saved in: /var/home/cloud-user/.local/share/instructlab/phased/phase1/checkpoints/full_state/epoch_6 Epoch 6: 100%|██████████| 40/40 [02:41<00:00, 4.03s/it] Training subprocess has not exited yet. Sending SIGTERM. Waiting for process to exit, 60s... --- Logging error --- Traceback (most recent call last): File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 261, in _run_phase _training_phase( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 563, in _training_phase run_training(train_args=train_args, torch_args=torch_args) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/__init__.py", line 36, in run_training return run_training(torch_args=torch_args, train_args=train_args) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/instructlab/training/main_ds.py", line 837, in run_training raise RuntimeError( RuntimeError: Suffered a failure during distributed training. Please see the training logs for more context. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib64/python3.11/logging/__init__.py", line 1110, in emit msg = self.format(record) ^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/logging/__init__.py", line 953, in format return fmt.format(record) ^^^^^^^^^^^^^^^^^^ File "/opt/app-root/lib64/python3.11/site-packages/instructlab/log.py", line 19, in format return super().format(record) ^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/logging/__init__.py", line 687, in format record.message = record.getMessage() ^^^^^^^^^^^^^^^^^^^ File "/usr/lib64/python3.11/logging/__init__.py", line 377, in getMessage msg = msg % self.args ~~~~^~~~~~~~~~~ TypeError: not all arguments converted during string formatting Call stack: File "/opt/app-root/bin/ilab", line 8, in <module> sys.exit(ilab()) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1161, in __call__ return self.main(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1082, in main rv = self.invoke(ctx) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1697, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 1443, in invoke return ctx.invoke(self.callback, **ctx.params) File "/opt/app-root/lib64/python3.11/site-packages/click/core.py", line 788, in invoke return __callback(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/click/decorators.py", line 33, in new_func return f(get_current_context(), *args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/clickext.py", line 356, in wrapper return f(*args, **kwargs) File "/opt/app-root/lib64/python3.11/site-packages/instructlab/cli/model/train.py", line 469, in train accelerated_train.accelerated_train( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 202, in accelerated_train _run_phased_training( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 342, in _run_phased_training _run_phase( File "/opt/app-root/lib64/python3.11/site-packages/instructlab/model/accelerated_train.py", line 276, in _run_phase logger.error("Failed during training loop: ", e) Message: 'Failed during training loop: ' Arguments: (RuntimeError('Suffered a failure during distributed training. Please see the training logs for more context.'),) Accelerated Training failed with 1
Device Info (please complete the following information):
- Hardware Specs: AWS P5..48xlarge
- OS Version: Red Hat Enterprise Linux 9.20250213.0.4 (Plow)
- InstructLab Version: 0.23.2
- Provide the output of these two commands:
- "registry.redhat.io/rhelai1/bootc-aws-nvidia-rhel9:1.4"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 8 CUDA devices: Device 0: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 1: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 2: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 3: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 4: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 5: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 6: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Device 7: NVIDIA H100 80GB HBM3, compute capability 9.0, VMM: yes Platform: sys.version: 3.11.7 (main, Jan 8 2025, 00:00:00) [GCC 11.4.1 20231218 (Red Hat 11.4.1-3)] sys.platform: linux os.name: posix platform.release: 5.14.0-427.50.2.el9_4.x86_64 platform.machine: x86_64 platform.node: ip-172-31-0-67.us-east-2.compute.internal platform.python_version: 3.11.7 os-release.ID: rhel os-release.VERSION_ID: 9.4 os-release.PRETTY_NAME: Red Hat Enterprise Linux 9.4 (Plow) memory.total: 1999.96 GB memory.available: 1938.05 GB memory.used: 36.38 GBInstructLab: instructlab.version: 0.23.2 instructlab-dolomite.version: 0.2.0 instructlab-eval.version: 0.5.1 instructlab-quantize.version: 0.1.0 instructlab-schema.version: 0.4.2 instructlab-sdg.version: 0.7.1 instructlab-training.version: 0.7.0Torch: torch.version: 2.5.1 torch.backends.cpu.capability: AVX2 torch.version.cuda: 12.4 torch.version.hip: None torch.cuda.available: True torch.backends.cuda.is_built: True torch.backends.mps.is_built: False torch.backends.mps.is_available: False torch.cuda.bf16: True torch.cuda.current.device: 0 torch.cuda.0.name: NVIDIA H100 80GB HBM3 torch.cuda.0.free: 5.2 GB torch.cuda.0.total: 79.1 GB torch.cuda.0.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.1.name: NVIDIA H100 80GB HBM3 torch.cuda.1.free: 5.1 GB torch.cuda.1.total: 79.1 GB torch.cuda.1.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.2.name: NVIDIA H100 80GB HBM3 torch.cuda.2.free: 4.7 GB torch.cuda.2.total: 79.1 GB torch.cuda.2.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.3.name: NVIDIA H100 80GB HBM3 torch.cuda.3.free: 4.9 GB torch.cuda.3.total: 79.1 GB torch.cuda.3.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.4.name: NVIDIA H100 80GB HBM3 torch.cuda.4.free: 5.1 GB torch.cuda.4.total: 79.1 GB torch.cuda.4.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.5.name: NVIDIA H100 80GB HBM3 torch.cuda.5.free: 4.9 GB torch.cuda.5.total: 79.1 GB torch.cuda.5.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.6.name: NVIDIA H100 80GB HBM3 torch.cuda.6.free: 4.9 GB torch.cuda.6.total: 79.1 GB torch.cuda.6.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute) torch.cuda.7.name: NVIDIA H100 80GB HBM3 torch.cuda.7.free: 5.1 GB torch.cuda.7.total: 79.1 GB torch.cuda.7.capability: 9.0 (see https://developer.nvidia.com/cuda-gpus#compute)llama_cpp_python: llama_cpp_python.version: 0.3.2 llama_cpp_python.supports_gpu_offload: True
Bug impact
- It interrupts the phase 1 training progress and the user would have to repeat the multi-phase training resulting in reduced GPU utilization and higher overall costs
Known workaround
- Re-run the worklfow, there are no changes needed.
Additional context
- This was also observed on the AMD machine by fzatlouk@redhat.com and the workaround was the same to re-run this without any change.
- …
- relates to
-
RHELAI-3879 ilab train AssertionError: MultiQueryAttention should have 1 head for keys and values
-
- Closed
-