Loading...

XML

Word

Printable

Type: Story
Resolution: Obsolete
Priority: Undefined
Fix Version/s: Feb 11
Affects Version/s: Alongside OpenShift 4.13
Component/s: AI/ML
Labels:
None

Workstream:

RHOAI, Training
Ready:
False
Blocked:
False
Blocked Reason:
None

Intelligence Requested:
Market:

User Story:
As a PSAP engineer,

I want to test and improve the performance of Multi-node MLPerf v4.0 Training models with direct attached NVME drives, Nvidia Network Operator, GPU Direct RDMA and the latest versions of OCP.

So that we can write instructions for Supermicro to use in their lab and debug and fix and issues/errors/performance problems we encounter.
Supermicro will use Ifiniband in their lab and we will use RoCE in the Alias lab. We do not have the infiniband hardware to mirror exactly what they have in the Supermicro lab.
I would also like to test Chakra (mlcommons project for profiling network usage while running ML Workloads).

Acceptance criteria: Multi-node Training of MLPerf v4.0 on OCP with infiniband and GPU Direct RDMA scales near to the scaling for each model achieved by Nvidia with their software stack.

Assignee:: Diane Feddema

Reporter:: Diane Feddema

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/12/07 12:42 AM

Updated:: 2025/02/11 11:00 PM

Resolved:: 2024/11/15 6:52 PM

Target start:: 2023/12/07

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates