Loading...

XML

Word

Printable

Type: Task
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: InstructLab - Training
Labels:
- closed-upstream
- github

Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

[3053343113] Upstream Reporter: James Kunstle
Upstream issue status: Closed
Upstream description:

Issue happened while running on 1xH200. FSDP defaulted to NO_SHARD since world_size=1.

Epoch 0:   0%|          | 68/133000 [00:32<15:20:43,  2.41it/s]{
    "epoch": 0,
    "step": 68,
    "rank": 0,
    "overall_throughput": 5.975545350090432,
    "lr": 2e-05,
    "cuda_mem_allocated": 61.14389753341675,
    "max_cuda_mem_allocated": 63.741647720336914,
    "cuda_malloc_retries": 0,
    "num_loss_counted_tokens": 953,
    "batch_size": 2,
    "total_loss": 0.7966516740925039,
    "samples_seen": 171,
    "gradnorm": null,
    "total_samples": 311150,
    "timestamp": "2025-05-09T07:23:20.699976"
}
Epoch: 0, Step: 69, Rank: 0, loss = nan
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 1047, in <module>
[rank0]:     main(args)
[rank0]:   File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 692, in main
[rank0]:     train(
[rank0]:   File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 484, in train
[rank0]:     "total_loss": float(log_loss / num_loss_counted_tokens),
[rank0]:                         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
[rank0]: ZeroDivisionError: float division by zero
wandb:
wandb: ? View run happy-eon-77 at: https://wandb.ai/jkunstle-test/h200-config-building/runs/un7km6qo
wandb: Find logs at: wandb/run-20250509_072232-un7km6qo/logs
[rank0]:[W509 07:23:23.390667148 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0509 07:23:25.395000 152065 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 152132) of binary: /home/ec2-user/jkunstle/training/venv/bin/python3.11
Traceback (most recent call last):
  File "/home/ec2-user/jkunstle/training/venv/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/run.py", line 892, in main
    run(args)
  File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/run.py", line 883, in run
    elastic_launch(
  File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py FAILED

Upstream URL: https://github.com/instructlab/training/issues/547

links to

Upstream issue

Assignee:: Unassigned

Reporter:: Upstream Sync

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/05/09 10:59 PM

Updated:: 2025/09/08 2:11 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates