-
Feature Request
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
None
-
Product / Portfolio Work
-
None
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
My suggestion:
integrating support for GPU kernel modules (e.g., NVIDIA open-source GPU kernel modules) into OpenShift, leveraging the Node Tuning Operator (NTO) and TuneD to dynamically manage or optimize kernel configurations for GPU-accelerated workloads. This would enhance performance, flexibility, and scalability for AI/ML, HPC, and other GPU-intensive applications on OpenShift clusters, positioning OpenShift as a leader in AI on Kubernetes.
Competitive Advantage: By introducing GPU kernel management as a technical preview, OpenShift could be the first Kubernetes platform to offer this capability natively. This leadership in the AI category would differentiate OpenShift from competitors (e.g., vanilla Kubernetes, GKE, EKS), attracting AI/ML practitioners and enterprises seeking optimized GPU performance for training and inference workloads.
Use Case: Data scientists using OpenShift for distributed AI training (e.g., PyTorch or InstructLab) could leverage kernel-level tuning for RDMA or real-time processing, reducing communication overhead and improving scalability. TuneD’s profile-based approach would automate this, aligning with OpenShift’s operator-driven philosophy.
Proposal:
Support GPU Kernel Modules:
- Integrate NVIDIA’s open-source GPU kernel modules into Red Hat CoreOS (RHCOS) as optional packages, similar to kernel-rt for real-time kernels.
- Allow selection of GPU-optimized kernels via MachineConfig (e.g., kernelType: GPU).
Enhance TuneD for GPU Workloads:
- Add TuneD profiles tailored for GPU kernels (e.g., optimizing IRQ affinity, NUMA settings, or RDMA parameters for RoCE).
- Enable dynamic application of these profiles.
Technical Preview: Releasing this feature as a technical preview can be a great head start