XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: CNV Virt-Node
Labels:
- grooming

Epic Name:
design-dynamic-resource-allocation
Activity Type:
Product / Portfolio Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Status:
To Do
Feature Link:
VIRTSTRAT-249 - Dynamic Resource Allocation (DRA) for VMs
Parent Link:
VIRTSTRAT-249Dynamic Resource Allocation (DRA) for VMs
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Market:

Goal

Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.

Description

The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.

NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.

When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.

Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs [7]. There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.

User Stories

GPU consumption optimization
"As an Admin, I want to let NVIDIA GPU DRA driver provision vGPUs for OpenShift Virtualization, so that it optimizes the allocation with dynamic provisioning of time or MIG backed vGPUs"
GPU mixed types per server
"As an Admin, I want to be able to mix different types of GPU to collocate different types of workloads on the same host, in order to improve multi-pod/stack performance.

Non-Requirements

List of things not included in this epic, to alleviate any doubt raised during the grooming process.

Notes

Any additional details or decisions made/needed

References

Done Checklist

Who	What	Reference
DEV	Upstream roadmap issue (or individual upstream PRs)	<link to GitHub Issue>
DEV	Upstream documentation merged	<link to meaningful PR>
DEV	gap doc updated	<name sheet and cell>
DEV	Upgrade consideration	<link to upgrade-related test or design doc>
DEV	CEE/PX summary presentation	label epic with cee-training and add a <link to your support-facing preso>
QE	Test plans in Polarion	<link or reference to Polarion>
QE	Automated tests merged	<link or reference to automated tests>
DOC	Downstream documentation merged	<link to meaningful PR>

is related to

CNV-31976 Allow using pci_bus_id to specify a PCI device for passthrough

links to

openshift/kubernetes#2103: OCPNODE-2627: enable upstream dra tests

Assignee:: Stuart Gott

Reporter:: Fabian Deutsch

Contributors:: Eran Ifrach

QA Contact:: Denys Shchedrivyi

Votes:: 0 Vote for this issue

Watchers:: 11 Start watching this issue

Created:: 2023/01/25 8:46 AM

Updated:: 2025/08/06 8:44 PM

Details

Description

Goal

Description

User Stories

Non-Requirements

Notes

References

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide