Epic Goal

Characterize performance of GPU sharing capabilities and evaluate the advanced scheduling capabilities for distributed compute workloads of the run.ai software stack on OpenShift
https://www.run.ai/platform/kubernetes-scheduler/
https://www.run.ai/blog/runai-creates-first-fractional-gpu-sharing-for-kubernetes-deep-learning-workloads/

Why is this important?

run.ai is a certified operator in OpenShift. We would like to run standard AI/ML benchmarks (mlperf, Nvidia deep learning, phoronix etc) and evaluate the overhead the GPU sharing stack introduces. At the same time, run.ai provides advanced scheduling capabilities like queuing and gang scheduling for distributed compute jobs. It is important to evaluate those capabilities to showcase how customers can run batch jobs at scale using run.ai software stack on OpenShift

multiple AI/ML workload pods accessing the GPU sequentially (without the run.ai stack) using Kubernetes extended resource mechanism
multiple AI/ML workload pods accessing the GPU in parallel without using the Kubernetes extended resource mechanism
multiple AI/ML workload pods accessing the GPU in parallel using Kubernetes extended resource mechanism and the run.ai sharing mechanism
multiple distributed ML and HPC jobs at scale using GPUs leveraging the run.ai scheduler
launch batch jobs from RHODS or opendatahub using run.ai scheduler