Loading...

Type: Feature
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Node, Specialized Hardware & Enablement
Labels:

Work Type:
BU Product Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
design-dynamic-resource-allocation
Feature Link:
CNV-24729 - Dynamic Resource Allocation (DRA) for VMs
Hierarchy Progress Bar:

100% To Do, 0% In Progress, 0% Done

Risk Score:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

An elevator pitch (value statement) that describes the Feature in a clear, concise way. Complete during New status.

DRA structured parameters are designed to improve how specialized hardware resources (like GPUs or network devices) are allocated to pods. Instead of relying on opaque parameters handled by third-party drivers, this system introduces ResourceSlice objects that list available devices on nodes with attributes (e.g., GPU model, VRAM), and ResourceClaim objects where users specify resource requirements using CEL expressions (e.g., device.attributes["vendor"] == "nvidia"). The Kubernetes scheduler directly matches claims to slices, eliminating delays from driver communication and enabling native cluster autoscaling decisions

Goals (aka. expected user outcomes)

The observable functionality that the user now has as a result of receiving this feature. Include the anticipated primary user type/persona and which existing features, if any, will be expanded. Complete during New status.

Describe devices with a name and some associated attributes that are used to select devices.

Requirements (aka. Acceptance Criteria):

A list of specific needs or objectives that a feature must deliver in order to be considered complete. Be sure to include nonfunctional requirements such as security, reliability, performance, maintainability, scalability, usability, etc. Initial completion during Refinement status.

Anyone reviewing this Feature needs to know which deployment configurations that the Feature will apply to (or not) once it's been completed. Describe specific needs (or indicate N/A) for each of the following deployment scenarios. For specific configurations that are out-of-scope for a given release, ensure you provide the OCPSTRAT (for the future to be supported configuration) as well.

Deployment considerations	List applicable specific needs (N/A = not applicable)
Self-managed, managed, or both	Both
Classic (standalone cluster)	Yes
Hosted control planes	Yes
Multi node, Compact (three node), or Single node (SNO), or all	All
Connected / Restricted Network	All
Architectures, e.g. x86_x64, ARM (aarch64), IBM Power (ppc64le), and IBM Z (s390x)	TBD
Operator compatibility	TBD
Backport needed (list applicable versions)	No
UI need (e.g. OpenShift Console, dynamic plugin, OCM)	TBD
Other (please specify)

Use Cases (Optional):

Include use case diagrams, main success scenarios, alternative flow scenarios. Initial completion during Refinement status.

AI/ML Workloads: Allocate GPUs dynamically for model training.
Hardware Sharing: Split a GPU across multiple pods with fine-grained control.
Multi-Cloud Portability: Standardized resource requests work across clusters.

Structured parameters mark a shift toward Kubernetes-native resource management, reducing dependency on external drivers for core scheduling decisions.

Questions to Answer (Optional):

Include a list of refinement / architectural questions that may need to be answered before coding can begin. Initial completion during Refinement status.

Out of Scope

High-level list of items that are out of scope. Initial completion during Refinement status.

Background

Provide any additional context is needed to frame the feature. Initial completion during Refinement status.

Kubernetes Dynamic Resource Allocation (DRA) structured parameters are a framework that brings transparency and efficiency to how specialized hardware resources (like GPUs, NICs, or FPGAs) are requested and allocated in Kubernetes clusters. Introduced in Kubernetes 1.30 and enhanced in 1.32 (beta), they solve a critical limitation of earlier dynamic resource allocation (DRA) by letting Kubernetes understand resource requirements natively.

Key Components

ResourceSlice: A Kubernetes object that lists available devices (e.g. GPUs) on nodes, including attributes like model or memory
ResourceClaim: A user’s request for specific resources (e.g., "2 NVIDIA A100 GPUs with 80GB VRAM")
DeviceClass: Predefined criteria (e.g., GPU type, driver version) that simplify resource requests

DRA has been introduced in Kubernetes 1.26 as an alpha feature. NVIDIA is targeting end of Q3 for the enablement of DRA in their GPU Operator, which would align on OCP 4.14. And their MIG-backed vGPU capability will likely be based on DRA.

On the OpenShift Virt side, Fabian Deutsch has created the CNV-24730 epic to track the work on DRA usage for VMs, with MIG-backed vGPU as a focus. And we are in touch with Kevin Klues on the NVIDIA side who is the co-author of the KEP and maintainer of NVIDIA DRA driver for GPU.

Having Dynamic Resource Allocation available in OpenShift 4.13, possibly behind a feature gate, would allow us to experiment on the implementation in OpenShift Virtualization.

References

KEP-4381: Dynamic Resource Allocation with Structured Parameters
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/4381-dra-structured-parameters/README.md
Kubernetes > Dynamic Resource Allocation
Kubernetes Enhancement Proposal - Dynamic Resource Allocation
Customer Considerations

Provide any additional customer-specific considerations that must be made when designing and delivering the Feature. Initial completion during Refinement status.

Documentation Considerations

Provide information that needs to be considered and planned so that documentation will meet customer needs. If the feature extends existing functionality, provide a link to its current documentation. Initial completion during Refinement status.

Interoperability Considerations

Which other projects, including ROSA/OSD/ARO, and versions in our portfolio does this feature impact? What interoperability test scenarios should be factored by the layered products? Initial completion during Refinement status.

is incorporated by

OCPSTRAT-1692 AI Workloads for OpenShift

In Progress

is related to

OCPSTRAT-999 Tech Preview DRA in 4.15

Closed

relates to

OCPSTRAT-1756 [Upstream] Attribute-Based GPU Allocation in OpenShift with NVIDIA GPU operator

Refinement

links to

openshift/api#1411: WRKLDS-705: TechPreview: enable DynamicResourceAllocation feature gate

Details

Description

Feature Overview (aka. Goal Summary)

Goals (aka. expected user outcomes)

Requirements (aka. Acceptance Criteria):

Use Cases (Optional):

Questions to Answer (Optional):

Out of Scope

Background

Customer Considerations

Documentation Considerations

Interoperability Considerations

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates