-
Task
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
False
-
-
False
-
-
[3053343113] Upstream Reporter: James Kunstle
Upstream issue status: Closed
Upstream description:
Issue happened while running on 1xH200. FSDP defaulted to NO_SHARD since world_size=1.
Epoch 0: 0%| | 68/133000 [00:32<15:20:43, 2.41it/s]{ "epoch": 0, "step": 68, "rank": 0, "overall_throughput": 5.975545350090432, "lr": 2e-05, "cuda_mem_allocated": 61.14389753341675, "max_cuda_mem_allocated": 63.741647720336914, "cuda_malloc_retries": 0, "num_loss_counted_tokens": 953, "batch_size": 2, "total_loss": 0.7966516740925039, "samples_seen": 171, "gradnorm": null, "total_samples": 311150, "timestamp": "2025-05-09T07:23:20.699976" } Epoch: 0, Step: 69, Rank: 0, loss = nan [rank0]: Traceback (most recent call last): [rank0]: File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 1047, in <module> [rank0]: main(args) [rank0]: File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 692, in main [rank0]: train( [rank0]: File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 484, in train [rank0]: "total_loss": float(log_loss / num_loss_counted_tokens), [rank0]: ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~ [rank0]: ZeroDivisionError: float division by zero wandb: wandb: ? View run happy-eon-77 at: https://wandb.ai/jkunstle-test/h200-config-building/runs/un7km6qo wandb: Find logs at: wandb/run-20250509_072232-un7km6qo/logs [rank0]:[W509 07:23:23.390667148 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) E0509 07:23:25.395000 152065 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 152132) of binary: /home/ec2-user/jkunstle/training/venv/bin/python3.11 Traceback (most recent call last): File "/home/ec2-user/jkunstle/training/venv/bin/torchrun", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/run.py", line 892, in main run(args) File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/run.py", line 883, in run elastic_launch( File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ /home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py FAILED
Upstream URL: https://github.com/instructlab/training/issues/547
- links to