Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.
The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.
NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.
When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.
Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs . There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.
- GPU consumption optimization
"As an Admin, I want to let NVIDIA GPU DRA driver provision vGPUs for OpenShift Virtualization, so that it optimizes the allocation with dynamic provisioning of time or MIG backed vGPUs"
- GPU mixed types per server
"As an Admin, I want to be able to mix different types of GPU to collocate different types of workloads on the same host, in order to improve multi-pod/stack performance.
- List of things not included in this epic, to alleviate any doubt raised during the grooming process.
- Any additional details or decisions made/needed
- Kubernetes > Dynamic Resource Allocation
- Kubernetes Enhancement Proposal - Dynamic Resource Allocation
- NVIDIA GPU DRA driver - Repository
- NVIDIA GPU DRA driver - demo recording
- NVIDIA Omniverse
- NVIDIA GPUs for Virtualization
|Upstream roadmap issue (or individual upstream PRs)
|<link to GitHub Issue>
|Upstream documentation merged
|<link to meaningful PR>
|gap doc updated
|<name sheet and cell>
|<link to upgrade-related test or design doc>
|CEE/PX summary presentation
|label epic with cee-training and add a <link to your support-facing preso>
|Test plans in Polarion
|<link or reference to Polarion>
|Automated tests merged
|<link or reference to automated tests>
|Downstream documentation merged
|<link to meaningful PR>