User Story:
As a PSAP engineer,
I want to test and improve the performance of Multi-node MLPerf v4.0 Training models with direct attached NVME drives, Nvidia Network Operator, GPU Direct RDMA and the latest versions of OCP.
So that we can write instructions for Supermicro to use in their lab and debug and fix and issues/errors/performance problems we encounter.
Supermicro will use Ifiniband in their lab and we will use RoCE in the Alias lab. We do not have the infiniband hardware to mirror exactly what they have in the Supermicro lab.
I would also like to test Chakra (mlcommons project for profiling network usage while running ML Workloads).
Acceptance criteria: Multi-node Training of MLPerf v4.0 on OCP with infiniband and GPU Direct RDMA scales near to the scaling for each model achieved by Nvidia with their software stack.