Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1140

Run ResNet50 multi-node training with Codeflare stack on OCP 4.12 on Alias lab cluster

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Obsolete
    • Icon: Undefined Undefined
    • None
    • None
    • AI/ML
    • None
    • 5

      User Story:
      As a  PSAP engineer I will create the necessary appwrapper yaml to train mulit-node ResNet50 on 3 worker nodes with A30s and 1 NVMe driver which contains the training data on each worker node.  This will required manual creation of the necessary appwrapper spec and the Codeflare team will generate this appwrapper yaml automatically in the future via torchx or codeflare-sdk. 

      I want to demonstrate this multi-node ResNet50 training with codeflare as a step in preparation for running MLPerf multi-node training. 

      The benefit of this is that I will create a blog post that our customers can follow when they want to do similar multi-node neural network model training on Openshift using the Codeflare stack. 

      Acceptance criteria:

              rhn-support-dfeddema Diane Feddema
              rhn-support-dfeddema Diane Feddema
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: