Loading...

XML

Word

Printable

Type: Feature
Resolution: Done-Errata
Priority: Critical
Fix Version/s: openshift-4.20
Affects Version/s: None
Component/s: ai-ml-workloads, Node
Labels:

Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-1692AI Workloads for OpenShift
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:
None
Ready:
False
Size:
None

Target Version:

openshift-4.20
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
None
PX Impact Score:
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

With the NVIDIA Kubernetes DRA driver integrated into OpenShift, GPU devices are advertised with detailed attributes, allowing Pods to request GPUs based on specific device characteristics. This attribute-based resource allocation enables fine-tuned control over which GPUs are assigned to workloads, ensuring that each Pod receives the exact GPU specifications it requires.

use case : picking up GPU based on attributes

Key Attributes Exposed by the NVIDIA GPU operator
The DRA driver advertises several GPU device attributes that OpenShift can utilize for precise GPU selection, including:

Product Name: Specifies the exact GPU model (e.g., NVIDIA A100, V100, or Tesla T4). Pods can request specific models based on performance requirements or compatibility with applications, ensuring that workloads leverage the best-suited hardware for their tasks.

GPU Memory Capacity: Allows Pods to request GPUs with a minimum or maximum memory capacity (e.g., 8 GB, 16 GB, 40 GB), essential for memory-intensive workloads like large model training or data processing. This attribute enables applications to allocate GPUs that meet memory needs without overcommitting or underutilizing resources.

Compute Capability: Specifies the compute capabilities of the GPU, such as CUDA versions supported. Pods can target GPUs based on compute capability to ensure compatibility with the application’s framework (e.g., TensorFlow, PyTorch) and leverage optimized processing capabilities.

Power and Thermal Profiles: Pods can request GPUs based on power usage or thermal characteristics, enabling power-sensitive or temperature-sensitive applications to operate efficiently. This is particularly useful in high-density environments where energy or cooling constraints are factors.

Device ID and Vendor ID: Identifies the GPU's hardware specifics, which allows applications that require specific vendors or device types to make targeted requests.

Driver Version: Pods can request GPUs running specific driver versions, ensuring compatibility with application dependencies and maximizing GPU feature access.

https://github.com/NVIDIA/k8s-dra-driver

In 4.19

Doc
Make sure E2E is enabled
Engage in upstream activities

is duplicated by

OCPSTRAT-408 Deprecated : Structured parameter in DRA : refer to https://issues.redhat.com/browse/OCPSTRAT-1756

Closed

is incorporated by

OCPSTRAT-1780 OpenShift Dynamic Resource Allocation for AI Workloads

links to

KEP-4381: Dynamic Resource Allocation with Structured Parameters

openshift/cluster-kube-scheduler-operator#561: OCPNODE-3192: enable the scheduler plugin if the feature gate DynamicResourceAllocation is enabled

openshift/cluster-kube-scheduler-operator#563: [release-4.19] OCPNODE-3192: enable the scheduler plugin if the feature gate DynamicResourceAllocation is enabled

openshift/openshift-docs#99730: Enable Dynamic Resource Allocations for openshift

(1 links to)

Assignee:: Gaurav Singh

Reporter:: Gaurav Singh

Need Info From:: None

Contributors:: None

Architect:: Abu Kashem

QA Contact:: Neelesh Agrawal

Doc Contact:: Matthew Werner

Product Operations Engineering Contact:: Kyle Walker

Votes:: 1 Vote for this issue

Watchers:: 21 Start watching this issue

Created:: 2024/11/06 4:08 PM

Updated:: 2025/11/21 7:37 PM

Resolved:: 2025/10/21 8:43 PM

Target end:: 2025/08/11

Details

Description

Feature Overview (aka. Goal Summary)

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates