Uploaded image for project: 'OpenShift Node'
  1. OpenShift Node
  2. OCPNODE-3672

DRA: Handle extended resource requests via DRA Driver (upstream work for 4.21)

XMLWordPrintable

    • DRA: Handle extended resources
    • Product / Portfolio Work
    • OCPSTRAT-2382DRA: Handle extended resource requests via DRA Driver (upstream work in 1.36)
    • 27% To Do, 0% In Progress, 73% Done
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • M
    • None
    • None

      OCP/Telco Definition of Done
      Epic Template descriptions and documentation.

      <--- Cut-n-Paste the entire contents of this description into your new Epic --->

      Epic Goal

      • The goal of this epic is to make sure the extended resources KEP is enabled as part of the Openshift and also track the work done upstream to promote this KEP to Beta in Kubernetes 1.35

      Why is this important?

      • It is important because it allows for a seamless transition and compatibility between older, simpler resource request methods(Device Plugins) and the advanced features of Dynamic Resource Allocation (DRA).It prevents a split ecosystem and simplifies the adoption of DRA for application developers.
      • Based on the motivation of the upstream KEP, it is required to enable the cluster administrators to transition to DRA gradually at their own pace, possibly one node a time, which means supporting clusters where some nodes use device plugins and some nodes use DRA drivers for the same hardware at the same time.

      Scenarios

      The Challenge:

      Imagine you are running a multi-tenant Kubernetes cluster used by various teams for Machine Learning (ML) workloads. These workloads rely heavily on specialized NVIDIA GPUs for training models.

      Historically, your cluster used the standard Kubernetes Device Plugin mechanism to expose the GPUs as an Extended Resource, specifically:

      • Resource Name: nvidia.com/gpu
      • Application Request: Pods request the resource like any other extended resource (e.g., requests: { nvidia.com/gpu: 1 }{}).

      This setup is simple but has limitations:

      1. It doesn't support sharing fine-grained portions of a device (like allocating specific amounts of GPU memory or compute slices).
      1. It doesn't allow expression-based filtering for specific attributes (e.g., "I need a GPU with at least 16GB of VRAM and a specific driver version").
      1. The complex logic for device-specific allocation and initialization is handled outside the core Kubernetes scheduler.

      The Solution: Adopting DRA

      To enable sophisticated features like GPU memory sharing and attribute-based device selection, you decide to implement a Dynamic Resource Allocation (DRA) driver.

      The problem is that your application developers have thousands of existing Pod manifests that all use the old, simple extended resource request: nvidia.com/gpu: 1.
      This KEP helps in mitigating the problem by supporting the same request method and also providing the DRA functionalities

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • ...

      Dependencies (internal and external)

      1. Upstream feature needs to be graduated to beta and then GA
      2. The GPU vendors need to update their respective dra-driver's code to support this upstream DRAExtendedResource feature

      Previous Work (Optional):

      1. DRA docs: https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/ 
      2. Extended Resource KEP Issue: https://github.com/kubernetes/enhancements/issues/5004 
      3. Enhancement Proposal : https://github.com/kubernetes/enhancements/pull/5136 
      4. Alpha Work: https://github.com/kubernetes/kubernetes/pull/130653 
      5. Alpha Docs: https://github.com/kubernetes/website/pull/51710 

      Open questions::

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

              svanka@redhat.com Sai Ramesh Vanka
              svanka@redhat.com Sai Ramesh Vanka
              None
              Ayato Tokubi, Sai Ramesh Vanka
              Aditi Sahay Aditi Sahay
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: