Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1027

Multi-node training of MLPerf on OpenShift with Supermicro and Nvidia

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Obsolete
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • Multi-node MLPerf v3.0 LLM training on OpenShift
    • RHOAI, Training
    • Not Selected
    • False
    • False
    • None
    • 0% To Do, 0% In Progress, 100% Done

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

      <--- Cut-n-Paste the entire contents of this description into your new Epic --->

      Epic Goal

      • Determine best practices for multi-node training of a LLM from MLPerf v3.0 on Openshift with Nvidia GPUs and NVAI-E.
      •  

      Why is this important?

      • Red Hat wants to submit MLPerf v3.0 distributed training formal results with Supermicro on OpenShift using NVAI-E. 
      • We want to provide out customers with best practices for multi-node training of LLMs on Openshift.

      Scenarios

      1. ...

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. On 2/27/2023 Diane Feddema and Erwan Gallen met with Nvidia PMs and Engineers, Joanne Clark/Charlie Huang/Priya Tikoo to discuss NVAI-E on OpenShift and multi-node training.   Joanne/Charlie/Priya will give us the BERT preprocessed dataset (pre-processed for Nvidia Pytorch, MLPerf v2.1) and their multi-node distributed training code for BERT (MLPerf v2.1).   We need this preprocessed data in order to train BERT with NVAI-E pytorch.   Erwan explained that an OpenShift  customer has requested guidance regarding running multi-node training of models on Openshift with Nvidia GPUs. 
      2. On 3/02/2023 Diane Feddema attended MLPerf Training WG meeting and ask John Tran if he could assist in getting the Nvidia v2.1 BERT submission data preprocessing code working.  A  seg fault occurs in the preprocessing script and it does not finish successfully.  Nvidia has a company holiday for the rest of this week so John said to expect a reply next week (3/06/2023). 

      Previous Work (Optional):

      Open questions::

      1. Run in MLFLow to log output of training runs?
      2. Train BERT with GPUDirect RDMA using in-tree or out-of-tree driver?
        See  PSAP-1016
      3.  

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              rhn-support-dfeddema Diane Feddema
              rhn-support-dfeddema Diane Feddema
              Eran Ifrach
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: