-
Feature
-
Resolution: Unresolved
-
Major
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
CNV-24729 - Dynamic Resource Allocation (DRA) for VMs
-
100% To Do, 0% In Progress, 0% Done
-
0
Feature Overview (aka. Goal Summary)
An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.
DRA structured parameters are designed to improve how specialized hardware resources (like GPUs or network devices) are allocated to pods. Instead of relying on opaque parameters handled by third-party drivers, this system introduces ResourceSlice objects that list available devices on nodes with attributes (e.g., GPU model, VRAM), and ResourceClaim objects where users specify resource requirements using CEL expressions (e.g., device.attributes["vendor"] == "nvidia"). The Kubernetes scheduler directly matches claims to slices, eliminating delays from driver communication and enabling native cluster autoscaling decisions
Goals (aka. expected user outcomes)
The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.
Describe devices with a name and some associated attributes that are used to select devices.
Requirements (aka. Acceptance Criteria):
A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.
<enter general Feature acceptance here>
Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.
Deployment considerations | List applicable specific needs (N/A = not applicable) |
Self-managed, managed, or both | Both |
Classic (standalone cluster) | Yes |
Hosted control planes | Yes |
Multi node, Compact (three node), or Single node (SNO), or all | All |
Connected / Restricted Network | All |
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x) | TBD |
Operator compatibility | TBD |
Backport needed (list applicable versions) | No |
UI need (e.g. OpenShift Console, dynamic plugin, OCM) | TBD |
Other (please specify) |
Use Cases (Optional):
Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.
- AI/ML Workloads: Allocate GPUs dynamically for model training.
- Hardware Sharing: Split a GPU across multiple pods with fine-grained control.
- Multi-Cloud Portability: Standardized resource requests work across clusters.
Structured parameters mark a shift toward Kubernetes-native resource management, reducing dependency on external drivers for core scheduling decisions.
Questions to Answer (Optional):
Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.
<your text here>
Out of Scope
High-level list of items that are out of scope. Initial completion during Refinement status.
<your text here>
Background
Provide any additional context is needed to frame the feature. Initial completion during Refinement status.
Kubernetes Dynamic Resource Allocation (DRA) structured parameters are a framework that brings transparency and efficiency to how specialized hardware resources (like GPUs, NICs, or FPGAs) are requested and allocated in Kubernetes clusters. Introduced in Kubernetes 1.30 and enhanced in 1.32 (beta), they solve a critical limitation of earlier dynamic resource allocation (DRA) by letting Kubernetes understand resource requirements natively.
Key Components
- ResourceSlice: A Kubernetes object that lists available devices (e.g. GPUs) on nodes, including attributes like model or memory
- ResourceClaim: A user’s request for specific resources (e.g., "2 NVIDIA A100 GPUs with 80GB VRAM")
- DeviceClass: Predefined criteria (e.g., GPU type, driver version) that simplify resource requests
DRA has been introduced in Kubernetes 1.26 as an alpha feature. NVIDIA is targeting end of Q3 for the enablement of DRA in their GPU Operator, which would align on OCP 4.14. And their MIG-backed vGPU capability will likely be based on DRA.
On the OpenShift Virt side, Fabian Deutsch has created the CNV-24730 epic to track the work on DRA usage for VMs, with MIG-backed vGPU as a focus. And we are in touch with Kevin Klues on the NVIDIA side who is the co-author of the KEP and maintainer of NVIDIA DRA driver for GPU.
Having Dynamic Resource Allocation available in OpenShift 4.13, possibly behind a feature gate, would allow us to experiment on the implementation in OpenShift Virtualization.
References
- KEP-4381: Dynamic Resource Allocation with Structured Parameters
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md - Kubernetes > Dynamic Resource Allocation
- Kubernetes Enhancement Proposal - Dynamic Resource Allocation
Customer Considerations
Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.
<your text here>
Documentation Considerations
Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.
<your text here>
Interoperability Considerations
Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.
<your text here>
- is incorporated by
-
OCPSTRAT-1692 AI Workloads for OpenShift
-
- In Progress
-
- is related to
-
OCPSTRAT-999 Tech Preview DRA in 4.15
-
- Closed
-
- relates to
-
OCPSTRAT-1756 [Upstream] Attribute-Based GPU Allocation in OpenShift with NVIDIA GPU operator
-
- Refinement
-
- links to