Uploaded image for project: 'OpenShift Virtualization'
  1. OpenShift Virtualization
  2. CNV-24730

Design: Dynamic Resource Allocation (k8s feature) for VMs

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • CNV Virtualization
    • design-dynamic-resource-allocation
    • False
    • Hide

      None

      Show
      None
    • False
    • To Do
    • CNV-24729 - Dynamic Resource Allocation (DRA) for VMs
    • CNV-24729Dynamic Resource Allocation (DRA) for VMs
    • 50% To Do, 0% In Progress, 50% Done
    • ---
    • ---

      Goal

      Come up with a design of how resources provided by Dynamic Resource Allocation can be consumed by KubeVirt VMs.

      Description

      The Dynamic Resource Allocation (DRA) feature is an alpha API in Kubernetes 1.26, which is the base for OpenShift 4.13.
      This feature provides the ability to create ResourceClaim and ResourceClasse to request access to Resources. This is similar to the dynamic provisioning of PersistentVolume via PersistentVolumeClaim and StorageClasse.

      NVIDIA has been a lead contributor to the KEP and has already an initial implementation of a DRA driver and plugin, with a nice demo recording. NVIDIA is expecting to have this DRA driver available in CY23 Q3 or Q4, so likely in NVIDIA GPU Operator v23.9, around OpenShift 4.14.

      When asked about the availability of MIG-backed vGPU for Kubernetes, NVIDIA said that the timeframe is not decided yet, because it will likely use DRA for the MIG devices creation and their registration with the vGPU host driver. The MIG-base vGPU feature for OpenShift Virtualization will then likely require support of DRA to request vGPU resources for the VMs.

      Not having MIG-backed vGPU is a risk for OpenShift Virtualization adoption in GPU use cases, such as virtual workstations for rendering with Windows-only softwares. Customers who want to have a mix of passthrough, time-based vGPU and MIG-backed vGPU will prefer competitors who offer the full range of options. And the certification of NVIDIA solutions like NVIDIA Omniverse will be blocked, despite a great potential to increase the OpenShift consumption, as it uses RTX/A40 GPU for virtual workstations (not certified by NVIDIA on OpenShift Virtualization yet) and A100/H100 for physics simulation, both use cases probably leveraring vGPUs [7]. There's a lot of necessary conditions for that to happen and MIG-backed vGPU support is one of them.

      User Stories

      • GPU consumption optimization
        "As an Admin, I want to let NVIDIA GPU DRA driver provision vGPUs for OpenShift Virtualization, so that it optimizes the allocation with dynamic provisioning of time or MIG backed vGPUs"
      • GPU mixed types per server
        "As an Admin, I want to be able to mix different types of GPU to collocate different types of workloads on the same host, in order to improve multi-pod/stack performance.

      Non-Requirements

      • List of things not included in this epic, to alleviate any doubt raised during the grooming process.

      Notes

      • Any additional details or decisions made/needed

      References

      Done Checklist

      Who What Reference
      DEV Upstream roadmap issue (or individual upstream PRs) <link to GitHub Issue>
      DEV Upstream documentation merged <link to meaningful PR>
      DEV gap doc updated <name sheet and cell>
      DEV Upgrade consideration <link to upgrade-related test or design doc>
      DEV CEE/PX summary presentation label epic with cee-training and add a <link to your support-facingĀ preso>
      QE Test plans in Polarion <link or reference to Polarion>
      QE Automated tests merged <link or reference to automated tests>
      DOC Downstream documentation merged <link to meaningful PR>

              sgott@redhat.com Stuart Gott
              fdeutsch@redhat.com Fabian Deutsch
              Eran Ifrach
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: