-
Epic
-
Resolution: Done
-
Normal
-
None
-
None
-
Evaluate the run.ai software stack
-
False
-
False
-
PSAP Sprint 217, PSAP Sprint 218, PSAP Sprint 219, PSAP Sprint 220, PSAP Sprint 221
-
0% To Do, 0% In Progress, 100% Done
Epic Goal
- Characterize performance of GPU sharing capabilities and evaluate the advanced scheduling capabilities for distributed compute workloads of the run.ai software stack on OpenShift
- https://www.run.ai/platform/kubernetes-scheduler/
- https://www.run.ai/blog/runai-creates-first-fractional-gpu-sharing-for-kubernetes-deep-learning-workloads/
Why is this important?
- run.ai is a certified operator in OpenShift. We would like to run standard AI/ML benchmarks (mlperf, Nvidia deep learning, phoronix etc) and evaluate the overhead the GPU sharing stack introduces. At the same time, run.ai provides advanced scheduling capabilities like queuing and gang scheduling for distributed compute jobs. It is important to evaluate those capabilities to showcase how customers can run batch jobs at scale using run.ai software stack on OpenShift
Scenarios
- multiple AI/ML workload pods accessing the GPU sequentially (without the run.ai stack) using Kubernetes extended resource mechanism
- multiple AI/ML workload pods accessing the GPU in parallel without using the Kubernetes extended resource mechanism
- multiple AI/ML workload pods accessing the GPU in parallel using Kubernetes extended resource mechanism and the run.ai sharing mechanism
- multiple distributed ML and HPC jobs at scale using GPUs leveraging the run.ai scheduler
- launch batch jobs from RHODS or opendatahub using run.ai scheduler
Acceptance Criteria
- An internal facing performance and capabilities assessment report
Dependencies (internal and external)
- ...
Previous Work (Optional):
- …
Open questions::
- …
1.
|
Docs Tracker |
|
Closed | |
Ashish Kamra |
2.
|
TE Tracker |
|
Closed | |
Ashish Kamra |
3.
|
QE Tracker |
|
Closed | |
Ashish Kamra |