Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Priority: Critical
Fix Version/s: openshift-4.18
Affects Version/s: openshift-4.17, openshift-4.18, openshift-4.19
Component/s: Node
Labels:
- FAC:Green
- no_core_payload

Work Type:
Strategic Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Parent Link:
OCPSTRAT-1692AI Workloads for OpenShift
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Target Version:

openshift-4.18

Risk Score:
0

Discussion Needed:

Program Call

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

As an OpenShift administrator looking to run AI workloads on the platform, efficient GPU utilization is crucial given the high cost of these resources. While NVIDIA GPUs offer a method to pre-slice the GPU for multiple workloads, this approach can lead to resource wastage if the slicing does not align with the actual workload demands.

Therefore, I want to dynamically slice the GPU based on the specific requirements of each workload, ensuring optimal utilization and minimizing resource waste.

Current high level goals of InstaSlice:

Allocate MIG slices on NVIDIA GPUs on demand based on pod resource request/limits.
Configure allocated slices on GPUs and bind containers to intended slices
Account for other resources on the selected node (cpu, mem) before binding the pod
Release slices when a pod completes or is deleted

Some stretch/future goals:

Pods with multiple slices
Dynamically enabling and disabling MIG on GPUs
Heterogeneous GPU types in a cluster
Reliability and robustness by handling GPU and node failures
Compatibility/integration with Kueue, cluster autoscaling, and pod priorities and preemption

Plan [from OCPNODE-2416]

Phase 1 Goals (for OpenShift 4.18):

Allocate MIG slices on NVIDIA GPUs on demand (based on pod resource requests/limits)

Configure allocated slices on GPUs and bind containers to intended slices
Release and unconfigure slices when pods complete or are deleted
Account for node capacity (cpu, memory…) when selecting a node
requested capacity <= total node capacity - capacity already reserved for InstaSlice pods

Schedule pods in less than 10s in average (assuming resources are available)
Test scalability and stability to document how many nodes and GPUs are supported