Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-628

Evaluate various aspects of the run.ai software stack

XMLWordPrintable

    • Evaluate the run.ai software stack
    • False
    • False
    • PSAP Sprint 217, PSAP Sprint 218, PSAP Sprint 219, PSAP Sprint 220, PSAP Sprint 221
    • 0% To Do, 0% In Progress, 100% Done

      Epic Goal

      Why is this important?

      • run.ai is a certified operator in OpenShift. We would like to run standard AI/ML benchmarks (mlperf, Nvidia deep learning, phoronix etc) and evaluate the overhead the GPU sharing stack introduces. At the same time, run.ai provides advanced scheduling capabilities like queuing and gang scheduling for distributed compute jobs. It is important to evaluate those capabilities to showcase how customers can run batch jobs at scale using run.ai software stack on OpenShift

      Scenarios

      1. multiple AI/ML workload pods accessing the GPU sequentially (without the run.ai stack) using Kubernetes extended resource mechanism
      2. multiple AI/ML workload pods accessing the GPU in parallel without using the Kubernetes extended resource mechanism
      3. multiple AI/ML workload pods accessing the GPU in parallel using Kubernetes extended resource mechanism and the run.ai sharing mechanism
      4. multiple distributed ML and HPC jobs at scale using GPUs leveraging the run.ai scheduler
      5. launch batch jobs from RHODS or opendatahub using run.ai scheduler

      Acceptance Criteria

      • An internal facing performance and capabilities assessment report

      Dependencies (internal and external)

      1. ...

      Previous Work (Optional):

      Open questions::

       

              yfama Yuchen Fama
              akamra8979 Ashish Kamra
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: