Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4124

[instructlab/training] `num_loss_counted_tokens`== 0 for a batch causing a division by 0 error.

XMLWordPrintable

    • False
    • Hide

      None

      Show
      None
    • False

      [3053343113] Upstream Reporter: James Kunstle
      Upstream issue status: Closed
      Upstream description:

      Issue happened while running on 1xH200. FSDP defaulted to NO_SHARD since world_size=1.

      Epoch 0:   0%|          | 68/133000 [00:32<15:20:43,  2.41it/s]{
          "epoch": 0,
          "step": 68,
          "rank": 0,
          "overall_throughput": 5.975545350090432,
          "lr": 2e-05,
          "cuda_mem_allocated": 61.14389753341675,
          "max_cuda_mem_allocated": 63.741647720336914,
          "cuda_malloc_retries": 0,
          "num_loss_counted_tokens": 953,
          "batch_size": 2,
          "total_loss": 0.7966516740925039,
          "samples_seen": 171,
          "gradnorm": null,
          "total_samples": 311150,
          "timestamp": "2025-05-09T07:23:20.699976"
      }
      Epoch: 0, Step: 69, Rank: 0, loss = nan
      [rank0]: Traceback (most recent call last):
      [rank0]:   File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 1047, in <module>
      [rank0]:     main(args)
      [rank0]:   File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 692, in main
      [rank0]:     train(
      [rank0]:   File "/home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py", line 484, in train
      [rank0]:     "total_loss": float(log_loss / num_loss_counted_tokens),
      [rank0]:                         ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~
      [rank0]: ZeroDivisionError: float division by zero
      wandb:
      wandb: ? View run happy-eon-77 at: https://wandb.ai/jkunstle-test/h200-config-building/runs/un7km6qo
      wandb: Find logs at: wandb/run-20250509_072232-un7km6qo/logs
      [rank0]:[W509 07:23:23.390667148 ProcessGroupNCCL.cpp:1476] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
      E0509 07:23:25.395000 152065 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 152132) of binary: /home/ec2-user/jkunstle/training/venv/bin/python3.11
      Traceback (most recent call last):
        File "/home/ec2-user/jkunstle/training/venv/bin/torchrun", line 8, in <module>
          sys.exit(main())
                   ^^^^^^
        File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
          return f(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^
        File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/run.py", line 892, in main
          run(args)
        File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/run.py", line 883, in run
          elastic_launch(
        File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 139, in __call__
          return launch_agent(self._config, self._entrypoint, list(args))
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        File "/home/ec2-user/jkunstle/training/venv/lib64/python3.11/site-packages/torch/distributed/launcher/api.py", line 270, in launch_agent
          raise ChildFailedError(
      torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
      ============================================================
      /home/ec2-user/jkunstle/training/src/instructlab/training/main_ds.py FAILED

      Upstream URL: https://github.com/instructlab/training/issues/547

              Unassigned Unassigned
              upstream-sync Upstream Sync
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: