Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: ai-ml-workloads, AI/ML Workloads, openshift-dedicated
Labels:
- HP-AI-Productization

Target Version:
None
Activity Type:
Product / Portfolio Work
Status Summary:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Products:
None
Hierarchy Progress Bar:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
None
PX Impact Score:
None
PX Impact Range:
None
PX Priority Data:
None
PX Technical Impact:
None
PX Technical Impact Notes:
None
PX Scheduling Request:
None

My suggestion:
integrating support for GPU kernel modules (e.g., NVIDIA open-source GPU kernel modules) into OpenShift, leveraging the Node Tuning Operator (NTO) and TuneD to dynamically manage or optimize kernel configurations for GPU-accelerated workloads. This would enhance performance, flexibility, and scalability for AI/ML, HPC, and other GPU-intensive applications on OpenShift clusters, positioning OpenShift as a leader in AI on Kubernetes.

Competitive Advantage: By introducing GPU kernel management as a technical preview, OpenShift could be the first Kubernetes platform to offer this capability natively. This leadership in the AI category would differentiate OpenShift from competitors (e.g., vanilla Kubernetes, GKE, EKS), attracting AI/ML practitioners and enterprises seeking optimized GPU performance for training and inference workloads.

Use Case: Data scientists using OpenShift for distributed AI training (e.g., PyTorch or InstructLab) could leverage kernel-level tuning for RDMA or real-time processing, reducing communication overhead and improving scalability. TuneD’s profile-based approach would automate this, aligning with OpenShift’s operator-driven philosophy.

Proposal:

Support GPU Kernel Modules:

Integrate NVIDIA’s open-source GPU kernel modules into Red Hat CoreOS (RHCOS) as optional packages, similar to kernel-rt for real-time kernels.
Allow selection of GPU-optimized kernels via MachineConfig (e.g., kernelType: GPU).

Enhance TuneD for GPU Workloads:

Add TuneD profiles tailored for GPU kernels (e.g., optimizing IRQ affinity, NUMA settings, or RDMA parameters for RoCE).
Enable dynamic application of these profiles.

Technical Preview: Releasing this feature as a technical preview can be a great head start

Assignee:: Boaz Ben Shabat

Reporter:: Boaz Ben Shabat

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/03/20 2:22 PM

Updated:: 2025/10/23 7:19 PM

Resolved:: 2025/03/20 2:39 PM

Target start:: None

Target end:: None

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates