-
Feature
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
Strategic Portfolio Work
-
False
-
-
False
-
OCPSTRAT-1692AI Workloads for OpenShift
-
75% To Do, 25% In Progress, 0% Done
-
0
-
Program Call
Feature Overview (aka. Goal Summary)
As an OpenShift administrator looking to run AI workloads on the platform, efficient GPU utilization is crucial given the high cost of these resources. While NVIDIA GPUs offer a method to pre-slice the GPU for multiple workloads, this approach can lead to resource wastage if the slicing does not align with the actual workload demands.
Therefore, I want to dynamically slice the GPU based on the specific requirements of each workload, ensuring optimal utilization and minimizing resource waste.
Current high level goals of InstaSlice:
- Allocate MIG slices on NVIDIA GPUs on demand based on pod resource request/limits.
- Configure allocated slices on GPUs and bind containers to intended slices
- Account for other resources on the selected node (cpu, mem) before binding the pod
- Release slices when a pod completes or is deleted
Some stretch/future goals:
- Pods with multiple slices
- Dynamically enabling and disabling MIG on GPUs
- Heterogeneous GPU types in a cluster
- Reliability and robustness by handling GPU and node failures
- Compatibility/integration with Kueue, cluster autoscaling, and pod priorities and preemption
Plan [from OCPNODE-2416]
Phase 1 Goals (for OpenShift 4.18):
- Allocate MIG slices on NVIDIA GPUs on demand (based on pod resource requests/limits)
- Configure allocated slices on GPUs and bind containers to intended slices
- Release and unconfigure slices when pods complete or are deleted
- Account for node capacity (cpu, memory…) when selecting a node
- requested capacity <= total node capacity - capacity already reserved for InstaSlice pods
- Schedule pods in less than 10s in average (assuming resources are available)
- Test scalability and stability to document how many nodes and GPUs are supported
Phase 1 Stretch Goals:
- Handle Pods requesting multiple slices (from one or multiple containers)
- Enable and disable MIG on demand
- Manage heterogeneous clusters with multiple GPU types
- Mitigate scheduling failures, node failures, GPUs failures
- Improve scheduling latency
Potential Phase 2 Goals:
- Provide compatible cluster autoscaler
- Implement pod priorities with MIG-aware preemption
- Align with DRA APIs (subject to DRA stabilizing and supporting MIG)
- Leverage DRA implementation
Non-Goals:
- Share MIG slices among multiple containers
- Achieve scheduling latency below 5s (need help from RH team)