Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-665

NVIDIA GPUs on Oracle Cloud Infrastructure

    XMLWordPrintable

Details

    • False
    • Hide

      None

      Show
      None
    • False
    • 50
    • 50% 50%
    • 0
    • 0

    Description

      Feature Overview

      Enable NVIDIA GPU on Oracle Cloud Infrastructure (OCI) for the OpenShift platform, allowing users to leverage GPU acceleration for their AI workloads seamlessly.

      Goals

      • Users can provision and configure NVIDIA GPU instances on Oracle Cloud Infrastructure within the OpenShift platform.
      • Users can seamlessly integrate GPU-accelerated libraries and frameworks into their AI workloads running on OpenShift.
      • Users can efficiently scale their GPU resources based on workload demands.
      • Users can monitor and manage GPU instances and performance metrics within the OpenShift console.
      • Users can share GPU memory between OCI worker nodes

      Requirements 

      • Containers should run CUDA and use one or multiple GPUs
      • MIG should work for A100 shapes
      • RDMA enabled by the NVIDIA Network Operator working to share GPU memory
      • RDMA enabled by the dma-buf working to share GPU memory
      • Enablement validated on two Compute shapes: BM.GPU4 and BM.GPU.A100

      Use Cases

      • Use Case 1: User provisions a GPU-enabled OCI instance through the OpenShift console and deploys an AI workload that leverages GPU acceleration.
      • Use Case 2: User scales up the GPU resources for an AI workload running on OpenShift to handle increased demand and achieve faster processing.
      • Use Case 3: User share GPU memory between two containers on two separated OpenShift worker nodes
      • Use Case 4: User monitors the GPU instance utilization through the OpenShift console

      Documentation Considerations

      • The NVIDIA GPU operator community documentation should be updated in the "Bare Metal / Virtual Machines with GPU Passthrough" section to confirm OCI support and list which OCI shapes are supported by NVIDIA.
      • The NVIDIA AI Enterprise support matrix should be updated  to confirm OCI support and list which OCI shapes are supported by NVIDIA.

      Attachments

        Issue Links

          Activity

            People

              egallen Erwan Gallen
              egallen Erwan Gallen
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated: