Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1526

FSDP training across multiple nodes

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • Feb 11
    • None
    • None
    • None
    • RHOAI, Training
    • False
    • False
    • None
    • 1
    • PSAP - General-10, PSAP - General-11, PSAP - General-12, PSAP - General-13

      User Story
      the goal is to get multi-node training going on the using both ethernet and RDMA
      the modules I target are:

      Meta-Llama-3-8B-Instruct
      Meta-Llama-3-70B-Instruct

      the hardware for this task is 3 X A30 GPU's - 1 per host - total of 3 hosts

              bbenshab Boaz Ben Shabat
              bbenshab Boaz Ben Shabat
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: