-
Story
-
Resolution: Obsolete
-
Undefined
-
None
-
None
-
None
User Story:
As a PSAP engineer I will create the necessary appwrapper yaml to train mulit-node ResNet50 on 3 worker nodes with A30s and 1 NVMe driver which contains the training data on each worker node. This will required manual creation of the necessary appwrapper spec and the Codeflare team will generate this appwrapper yaml automatically in the future via torchx or codeflare-sdk.
I want to demonstrate this multi-node ResNet50 training with codeflare as a step in preparation for running MLPerf multi-node training.
The benefit of this is that I will create a blog post that our customers can follow when they want to do similar multi-node neural network model training on Openshift using the Codeflare stack.
Acceptance criteria: