XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:

Epic Name:
DRA: Handle extended resources
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-2382DRA: Handle extended resource requests via DRA Driver (upstream work in 1.36)
Hierarchy Progress Bar:

27% To Do, 0% In Progress, 73% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
M

Target Version:
None
Release Blocker:
None

OCP/Telco Definition of Done
Epic Template descriptions and documentation.

<--- Cut-n-Paste the entire contents of this description into your new Epic --->

Epic Goal

The goal of this epic is to make sure the extended resources KEP is enabled as part of the Openshift and also track the work done upstream to promote this KEP to Beta in Kubernetes 1.35

Why is this important?

It is important because it allows for a seamless transition and compatibility between older, simpler resource request methods(Device Plugins) and the advanced features of Dynamic Resource Allocation (DRA).It prevents a split ecosystem and simplifies the adoption of DRA for application developers.

Based on the motivation of the upstream KEP, it is required to enable the cluster administrators to transition to DRA gradually at their own pace, possibly one node a time, which means supporting clusters where some nodes use device plugins and some nodes use DRA drivers for the same hardware at the same time.

Scenarios

The Challenge:

Imagine you are running a multi-tenant Kubernetes cluster used by various teams for Machine Learning (ML) workloads. These workloads rely heavily on specialized NVIDIA GPUs for training models.

Historically, your cluster used the standard Kubernetes Device Plugin mechanism to expose the GPUs as an Extended Resource, specifically:

Resource Name: nvidia.com/gpu

Application Request: Pods request the resource like any other extended resource (e.g., requests: { nvidia.com/gpu: 1 }{}).

This setup is simple but has limitations:

It doesn't support sharing fine-grained portions of a device (like allocating specific amounts of GPU memory or compute slices).

It doesn't allow expression-based filtering for specific attributes (e.g., "I need a GPU with at least 16GB of VRAM and a specific driver version").

The complex logic for device-specific allocation and initialization is handled outside the core Kubernetes scheduler.

The Solution: Adopting DRA

To enable sophisticated features like GPU memory sharing and attribute-based device selection, you decide to implement a Dynamic Resource Allocation (DRA) driver.

The problem is that your application developers have thousands of existing Pod manifests that all use the old, simple extended resource request: nvidia.com/gpu: 1.
This KEP helps in mitigating the problem by supporting the same request method and also providing the DRA functionalities

Acceptance Criteria

CI - MUST be running successfully with tests automated
Release Technical Enablement - Provide necessary release enablement details and documents.
...

Dependencies (internal and external)

Upstream feature needs to be graduated to beta and then GA
The GPU vendors need to update their respective dra-driver's code to support this upstream DRAExtendedResource feature

Previous Work (Optional):

DRA docs: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/
Extended Resource KEP Issue: https://github.com/kubernetes/enhancements/issues/5004
Enhancement Proposal : https://github.com/kubernetes/enhancements/pull/5136
Alpha Work: https://github.com/kubernetes/kubernetes/pull/130653
Alpha Docs: https://github.com/kubernetes/website/pull/51710

Open questions::

Done Checklist

CI - CI is running, tests are automated and merged.
Release Enablement <link to Feature Enablement Presentation>
DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
DEV - Downstream build attached to advisory: <link to errata>
QE - Test plans in Polarion: <link or reference to Polarion>
QE - Automated tests merged: <link or reference to automated tests>
DOC - Downstream documentation merged: <link to meaningful PR>

is cloned by

OCPNODE-3886 DRA: Handle extended resource requests via DRA Driver (upstream work for 4.22)

links to

KEP-5004: DRAExtendedResource metrics #134523

Upstream PR

Assignee:: Sai Ramesh Vanka

Reporter:: Sai Ramesh Vanka

Need Info From:: None

Contributors:: Ayato Tokubi, Sai Ramesh Vanka

QA Contact:: Aditi Sahay

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/09/11 12:22 PM

Updated:: 2025/11/21 4:50 PM

Details

Description

Epic Goal

Why is this important?

Scenarios

The Challenge:

The Solution: Adopting DRA

Acceptance Criteria

Dependencies (internal and external)

Previous Work (Optional):

Open questions::

Done Checklist

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates