Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-1876

SR-IOV support for NVIDIA GPUDirect RDMA and NVIDIA GPUDirect Storage

XMLWordPrintable

    • Product / Portfolio Work
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Feature Overview (aka. Goal Summary)  

      This feature extends Red Hat OpenShift’s SR-IOV networking capabilities to enable and optimize GPUDirect RDMA for AI/ML distributed training and inferencing workloads. By providing SR-IOV support for accelerated networking between GPUs, data-intensive applications can transfer large volumes of data directly between GPUs and NICs with minimal latency, improving performance, scalability, and overall resource efficiency for AI workloads.

      The need it to remove this note:

      Goals (aka. expected user outcomes)
      Provide GA support with a new GPU section in this documentation: 

      https://docs.openshift.com/container-platform/4.17/networking/hardware_networks/using-dpdk-and-rdma.html#example-vf-use-in-rdma-mode-mellanox_using-dpdk-and-rdma

      Requirements (aka. Acceptance Criteria):

      SR-IOV Network Operator supporting NVIDIA GPUDirect RDMA

      Use Cases:

      This feature enables NVIDIA clear support statement for:

      • NVIDIA GPUDirect RDMA
      • Distributed Red Hat OpenShift AI PyTorch training with support for multi-GPU and multi-node configurations. 
      • NVIDIA GPUDirect Storage

      Documentation Considerations

      Documentation should be updated with the section on NVIDIA GPU.

              mcurry@redhat.com Marc Curry
              egallen Erwan Gallen
              None
              None
              None
              None
              Ashley Hardin Ashley Hardin
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: