-
Epic
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
None
-
None
-
PCI passthrough for a AMD GPU MI210
-
21
-
False
-
-
False
-
Not Selected
-
?
-
?
-
To Do
-
RHOSSTRAT-78 - PCI passthrough for a AMD GPU devices
-
?
-
?
-
25% To Do, 0% In Progress, 75% Done
-
-
Description
As a cloud administrator using Red Hat OpenStack Services on OpenShift, I want the ability to perform PCI passthrough of AMD GPU MI210 devices to instances so that I can run AI and machine learning applications requiring direct GPU access within RHOSO.
Implementing this feature will allow instances to utilize AMD GPUs directly, providing the necessary computational power for advanced AI and machine learning tasks.
Acceptance Criteria
Support for AMD GPU PCI passthrough must be implemented, enabling RHOSO to configure PCI passthrough for AMD GPU devices to instances. Comprehensive documentation must be provided, including guides on setting up PCI passthrough for AMD GPUs, prerequisites, configuration steps, and troubleshooting common issues.
Compatibility must be ensured with a range of AMD GPU models commonly used in AI applications, and with the current versions of OpenShift and the podified OpenStack control plane. Security and isolation must be maintained to ensure that GPU passthrough does not compromise instance isolation or lead to data leakage.
Business Value
Enabling PCI passthrough for AMD GPU devices will allow RHOSO to support AI and machine learning workloads that require GPU acceleration. This adds significant value by meeting customer demand in sectors such as research, telco, finance, and healthcare, which rely on GPU-accelerated computing. It enhances RHOSO's competitiveness as a versatile platform capable of supporting advanced computational workloads for RHOAI and MLOps platforms.
Dependencies
This feature depends on hardware requirements, including physical servers equipped with AMD GPUs that support PCI passthrough and BIOS/UEFI settings configured to enable IOMMU (e.g., AMD-Vi). OpenStack Nova must support PCI passthrough configurations, and necessary drivers and kernel modules must be available on compute nodes. Integration with OpenShift must be ensured so that the podified control plane can manage and schedule instances requiring PCI passthrough.
Assumptions
It is assumed that users have administrative access to configure BIOS settings and install necessary hardware drivers. The compute nodes are running on Red Hat Enterprise Linux versions that support AMD GPU drivers and PCI passthrough features.
Risks
There is a risk of complex configuration, as setting up PCI passthrough can be intricate and may require in-depth knowledge of hardware and virtualization technologies. Hardware compatibility support for some specific AMD GPU models has to been checked.
Test Plan
Functional testing will verify that instances can detect and utilize the passed-through AMD GPU, and sample AI workloads will be run to test GPU performance within the instance. Compatibility testing will be conducted with various AMD GPU models and driver versions, ensuring compatibility across different compute node configurations. Security testing will ensure isolation between instances using GPU passthrough, validating that no unauthorized access to GPU memory occurs. Performance testing will benchmark GPU performance in passthrough mode versus bare-metal to assess any overhead.
Documentation
User guides must be updated to include a new section specifically for hardware passthrough, detailing prerequisites, configuration steps, and verification procedures. Release notes should highlight the addition of AMD GPU PCI passthrough support. A troubleshooting guide should document common issues and their resolutions related to GPU passthrough, aiding users in resolving potential problems efficiently.
- is depended on by
-
OSPRH-10827 AMD Infinity Fabric Support
- New
-
OSPRH-10916 Document limitations of migration to AMD Inifinity Fabric
- New
- mentioned in
-
Page Loading...