Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1203

Run Multi-node distributed Training of MLPerf v4.0 models on Alias Cluster with Nvidia Network Operator(GPUdirect RDMA)

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Obsolete
    • Icon: Undefined Undefined
    • Feb 11
    • Alongside OpenShift 4.13
    • AI/ML
    • None
    • RHOAI, Training
    • False
    • False
    • None

      User Story:
      As a PSAP engineer,

      I want to test and improve the performance of Multi-node MLPerf v4.0 Training models with  direct attached NVME drives, Nvidia Network Operator, GPU Direct RDMA and the latest versions of OCP.

      So that we can write instructions for Supermicro to use in their lab and debug and fix and issues/errors/performance problems we encounter. 
      Supermicro will use Ifiniband in their lab and we will use RoCE in the Alias lab.  We do not have the infiniband hardware to mirror exactly what they have in the Supermicro lab. 
      I would also like to test Chakra (mlcommons project for profiling network usage while running ML Workloads). 

      Acceptance criteria:  Multi-node Training of MLPerf v4.0 on OCP with infiniband and GPU Direct RDMA scales near to the scaling for each model achieved by Nvidia with their software stack. 

              rhn-support-dfeddema Diane Feddema (Inactive)
              rhn-support-dfeddema Diane Feddema (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: