-
Epic
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
None
-
None
-
Multi-node MLPerf v3.0 LLM training on OpenShift
-
RHOAI, Training
-
Not Selected
-
False
-
False
-
None
-
0% To Do, 0% In Progress, 100% Done
OCP/Telco Definition of Done
Epic Template descriptions and documentation.
<--- Cut-n-Paste the entire contents of this description into your new Epic --->
Epic Goal
- Determine best practices for multi-node training of a LLM from MLPerf v3.0 on Openshift with Nvidia GPUs and NVAI-E.
Why is this important?
- Red Hat wants to submit MLPerf v3.0 distributed training formal results with Supermicro on OpenShift using NVAI-E.
- We want to provide out customers with best practices for multi-node training of LLMs on Openshift.
Scenarios
- ...
Acceptance Criteria
- CI - MUST be running successfully with tests automated
- Release Technical Enablement - Provide necessary release enablement details and documents.
- ...
Dependencies (internal and external)
- On 2/27/2023 Diane Feddema and Erwan Gallen met with Nvidia PMs and Engineers, Joanne Clark/Charlie Huang/Priya Tikoo to discuss NVAI-E on OpenShift and multi-node training. Joanne/Charlie/Priya will give us the BERT preprocessed dataset (pre-processed for Nvidia Pytorch, MLPerf v2.1) and their multi-node distributed training code for BERT (MLPerf v2.1). We need this preprocessed data in order to train BERT with NVAI-E pytorch. Erwan explained that an OpenShift customer has requested guidance regarding running multi-node training of models on Openshift with Nvidia GPUs.
- On 3/02/2023 Diane Feddema attended MLPerf Training WG meeting and ask John Tran if he could assist in getting the Nvidia v2.1 BERT submission data preprocessing code working. A seg fault occurs in the preprocessing script and it does not finish successfully. Nvidia has a company holiday for the rest of this week so John said to expect a reply next week (3/06/2023).
Previous Work (Optional):
- …
Open questions::
- Run in MLFLow to log output of training runs?
- Train BERT with GPUDirect RDMA using in-tree or out-of-tree driver?
SeePSAP-1016
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>