Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-11010

PCI passthrough for AMD GPU MI210

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • None
    • None
    • None
    • PCI passthrough for a AMD GPU MI210
    • 21
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • ?
    • ?
    • To Do
    • RHOSSTRAT-78 - PCI passthrough for a AMD GPU devices
    • ?
    • ?
    • 25% To Do, 0% In Progress, 75% Done

      Description

      As a cloud administrator using Red Hat OpenStack Services on OpenShift, I want the ability to perform PCI passthrough of AMD GPU MI210 devices to instances so that I can run AI and machine learning applications requiring direct GPU access within RHOSO.

      Implementing this feature will allow instances to utilize AMD GPUs directly, providing the necessary computational power for advanced AI and machine learning tasks.

      Acceptance Criteria

      Support for AMD GPU PCI passthrough must be implemented, enabling RHOSO to configure PCI passthrough for AMD GPU devices to instances. Comprehensive documentation must be provided, including guides on setting up PCI passthrough for AMD GPUs, prerequisites, configuration steps, and troubleshooting common issues.

      Compatibility must be ensured with a range of AMD GPU models commonly used in AI applications, and with the current versions of OpenShift and the podified OpenStack control plane. Security and isolation must be maintained to ensure that GPU passthrough does not compromise instance isolation or lead to data leakage.

      Business Value

      Enabling PCI passthrough for AMD GPU devices will allow RHOSO to support AI and machine learning workloads that require GPU acceleration. This adds significant value by meeting customer demand in sectors such as research, telco, finance, and healthcare, which rely on GPU-accelerated computing. It enhances RHOSO's competitiveness as a versatile platform capable of supporting advanced computational workloads for RHOAI and MLOps platforms.

      Dependencies

      This feature depends on hardware requirements, including physical servers equipped with AMD GPUs that support PCI passthrough and BIOS/UEFI settings configured to enable IOMMU (e.g., AMD-Vi). OpenStack Nova must support PCI passthrough configurations, and necessary drivers and kernel modules must be available on compute nodes. Integration with OpenShift must be ensured so that the podified control plane can manage and schedule instances requiring PCI passthrough.

      Assumptions

      It is assumed that users have administrative access to configure BIOS settings and install necessary hardware drivers. The compute nodes are running on Red Hat Enterprise Linux versions that support AMD GPU drivers and PCI passthrough features.

      Risks

      There is a risk of complex configuration, as setting up PCI passthrough can be intricate and may require in-depth knowledge of hardware and virtualization technologies. Hardware compatibility support for some specific AMD GPU models has to been checked.

      Test Plan

      Functional testing will verify that instances can detect and utilize the passed-through AMD GPU, and sample AI workloads will be run to test GPU performance within the instance. Compatibility testing will be conducted with various AMD GPU models and driver versions, ensuring compatibility across different compute node configurations. Security testing will ensure isolation between instances using GPU passthrough, validating that no unauthorized access to GPU memory occurs. Performance testing will benchmark GPU performance in passthrough mode versus bare-metal to assess any overhead.

      Documentation

      User guides must be updated to include a new section specifically for hardware passthrough, detailing prerequisites, configuration steps, and verification procedures. Release notes should highlight the addition of AMD GPU PCI passthrough support. A troubleshooting guide should document common issues and their resolutions related to GPU passthrough, aiding users in resolving potential problems efficiently.

              geguileo@redhat.com Gorka Eguileor
              egallen Erwan Gallen
              rhos-dfg-ai-enablement
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: