XML

Word

Printable

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

Determine best practices for multi-node training of a LLM from MLPerf v3.0 on Openshift with Nvidia GPUs and NVAI-E.

Why is this important?

Red Hat wants to submit MLPerf v3.0 distributed training formal results with Supermicro on OpenShift using NVAI-E.
We want to provide out customers with best practices for multi-node training of LLMs on Openshift.

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

On 2/27/2023 Diane Feddema and Erwan Gallen met with Nvidia PMs and Engineers, Joanne Clark/Charlie Huang/Priya Tikoo to discuss NVAI-E on OpenShift and multi-node training. Joanne/Charlie/Priya will give us the BERT preprocessed dataset (pre-processed for Nvidia Pytorch, MLPerf v2.1) and their multi-node distributed training code for BERT (MLPerf v2.1). We need this preprocessed data in order to train BERT with NVAI-E pytorch. Erwan explained that an OpenShift customer has requested guidance regarding running multi-node training of models on Openshift with Nvidia GPUs.
On 3/02/2023 Diane Feddema attended MLPerf Training WG meeting and ask John Tran if he could assist in getting the Nvidia v2.1 BERT submission data preprocessing code working. A seg fault occurs in the preprocessing script and it does not finish successfully. Nvidia has a company holiday for the rest of this week so John said to expect a reply next week (3/06/2023).

Run in MLFLow to log output of training runs?
Train BERT with GPUDirect RDMA using in-tree or out-of-tree driver?
See ~~PSAP-1016~~

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>