Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1591

Dev P: Dynamic Accelerator slicer Operator (fka: InstaSlice)

XMLWordPrintable

    • Strategic Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-1692AI Workloads for OpenShift
    • 75% To Do, 25% In Progress, 0% Done
    • 0
    • Program Call

      Feature Overview (aka. Goal Summary)  

      As an OpenShift administrator looking to run AI workloads on the platform, efficient GPU utilization is crucial given the high cost of these resources. While NVIDIA GPUs offer a method to pre-slice the GPU for multiple workloads, this approach can lead to resource wastage if the slicing does not align with the actual workload demands.

      Therefore, I want to dynamically slice the GPU based on the specific requirements of each workload, ensuring optimal utilization and minimizing resource waste.

      Current high level goals of InstaSlice:

      • Allocate MIG slices on NVIDIA GPUs on demand based on pod resource request/limits.
      • Configure allocated slices on GPUs and bind containers to intended slices
      • Account for other resources on the selected node (cpu, mem) before binding the pod
      • Release slices when a pod completes or is deleted

      Some stretch/future goals:

      • Pods with multiple slices
      • Dynamically enabling and disabling MIG on GPUs
      • Heterogeneous GPU types in a cluster
      • Reliability and robustness by handling GPU and node failures
      • Compatibility/integration with Kueue, cluster autoscaling, and pod priorities and preemption

      Plan [from OCPNODE-2416]

      Phase 1 Goals (for OpenShift 4.18):

      • Allocate MIG slices on NVIDIA GPUs on demand (based on pod resource requests/limits)
      • Configure allocated slices on GPUs and bind containers to intended slices
      • Release and unconfigure slices when pods complete or are deleted
      • Account for node capacity (cpu, memory…) when selecting a node
      • requested capacity <= total node capacity - capacity already reserved for InstaSlice pods
      • Schedule pods in less than 10s in average (assuming resources are available)
      • Test scalability and stability to document how many nodes and GPUs are supported

      Phase 1 Stretch Goals:

      • Handle Pods requesting multiple slices (from one or multiple containers)
      • Enable and disable MIG on demand
      • Manage heterogeneous clusters with multiple GPU types
      • Mitigate scheduling failures, node failures, GPUs failures
      • Improve scheduling latency

      Potential Phase 2 Goals:

      • Provide compatible cluster autoscaler
      • Implement pod priorities with MIG-aware preemption

       

      • Align with DRA APIs (subject to DRA stabilizing and supporting MIG)
      • Leverage DRA implementation

      Non-Goals:

      • Share MIG slices among multiple containers
      • Achieve scheduling latency below 5s (need help from RH team)

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              Harshal Patil
              Harshal Patil Harshal Patil
              Aruna Naik Aruna Naik
              Daniel Macpherson Daniel Macpherson
              Mrunal Patel Mrunal Patel
              Eric Rich Eric Rich
              Votes:
              1 Vote for this issue
              Watchers:
              20 Start watching this issue

                Created:
                Updated: