Uploaded image for project: 'Red Hat Enterprise Linux AI'
  1. Red Hat Enterprise Linux AI
  2. RHELAI-4285

[instructlab/training] CPU Offloading w/ FSDP - gradient accumulation is potentially broken

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • InstructLab - Training
    • False
    • Hide

      None

      Show
      None
    • False

      [2813270334] Upstream Reporter: James Kunstle
      Upstream issue status: Open
      Upstream description:

      From the FSDP docs: "FSDP currently does not support gradient accumulation outside no_sync() when using CPU offloading. This is because FSDP uses the newly-reduced gradient instead of accumulating with any existing gradient, which can lead to incorrect results."

      https://pytorch.org/docs/stable/fsdp.html


      Upstream URL: https://github.com/instructlab/training/issues/414

              Unassigned Unassigned
              upstream-sync Upstream Sync
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: