Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1756

Attribute-Based GPU Allocation in OpenShift with NVIDIA K8s DRA Driver

XMLWordPrintable

    • BU Product Work
    • False
    • Hide

      None

      Show
      None
    • False
    • OCPSTRAT-1692AI Workloads for OpenShift
    • 100% To Do, 0% In Progress, 0% Done
    • 0

      Feature Overview (aka. Goal Summary)  

      With the NVIDIA Kubernetes DRA driver integrated into OpenShift, GPU devices are advertised with detailed attributes, allowing Pods to request GPUs based on specific device characteristics. This attribute-based resource allocation enables fine-tuned control over which GPUs are assigned to workloads, ensuring that each Pod receives the exact GPU specifications it requires.

      Key Attributes Exposed by the DRA Driver
      The DRA driver advertises several GPU device attributes that OpenShift can utilize for precise GPU selection, including:

      1. Product Name: Specifies the exact GPU model (e.g., NVIDIA A100, V100, or Tesla T4). Pods can request specific models based on performance requirements or compatibility with applications, ensuring that workloads leverage the best-suited hardware for their tasks.
      1. GPU Memory Capacity: Allows Pods to request GPUs with a minimum or maximum memory capacity (e.g., 8 GB, 16 GB, 40 GB), essential for memory-intensive workloads like large model training or data processing. This attribute enables applications to allocate GPUs that meet memory needs without overcommitting or underutilizing resources.
      1. Compute Capability: Specifies the compute capabilities of the GPU, such as CUDA versions supported. Pods can target GPUs based on compute capability to ensure compatibility with the application’s framework (e.g., TensorFlow, PyTorch) and leverage optimized processing capabilities.
      1. Power and Thermal Profiles: Pods can request GPUs based on power usage or thermal characteristics, enabling power-sensitive or temperature-sensitive applications to operate efficiently. This is particularly useful in high-density environments where energy or cooling constraints are factors.
      1. Device ID and Vendor ID: Identifies the GPU's hardware specifics, which allows applications that require specific vendors or device types to make targeted requests.
      1. Driver Version: Pods can request GPUs running specific driver versions, ensuring compatibility with application dependencies and maximizing GPU feature access.

       

      https://github.com/NVIDIA/k8s-dra-driver 

       

      https://github.com/NVIDIA/k8s-dra-driver

       

              gausingh@redhat.com Gaurav Singh
              gausingh@redhat.com Gaurav Singh
              Matthew Werner Matthew Werner
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: