-
Epic
-
Resolution: Unresolved
-
Undefined
-
None
-
None
-
None
-
None
-
AMD Infinity Fabric Support
-
False
-
-
False
-
Not Selected
-
Proposed
-
Proposed
-
To Do
-
RHOSSTRAT-240 - High-Bandwidth GPU Interconnects support
-
Proposed
-
Proposed
-
-
Goal
The goal of this Epic is to integrate AMD Infinity Fabric support within Red Hat OpenStack, allowing for efficient communication between AMD GPUs during AI/ML workloads. By incorporating RCCL (Radeon Collective Communication Library), users will benefit from faster data transfer and lower latency between GPUs, improving the performance and scalability of distributed training tasks. This Epic will help users maximize their AMD GPU infrastructure for high-performance AI workloads.
Acceptance Criteria
- AMD Infinity Fabric is fully supported within Red Hat OpenStack for multi-GPU setups, enabling high-speed inter-GPU communication.
- RCCL integration ensures optimized communication across AMD GPUs, improving performance in distributed AI workloads.
- Performance benchmarks demonstrate improvements in data transfer speed and communication efficiency compared to standard PCIe connections.
- Documentation provides clear instructions for configuring AMD Infinity Fabric and RCCL for AI workloads on Red Hat OpenStack.
- Infinity Fabric operates reliably in different node configurations and shows significant performance gains in popular AI frameworks (e.g., PyTorch).
- Demontrate vLLM use of RCCL with a RHEL AI VM and multiple AMD GPUs.
- Validate four MI210 GPUs with Infinity Fabric
- Validate eight MI300X GPUs with Infinity Fabric
Open Questions
Any additional details, questions, or decisions that need to be made/addressed.
- What are the specific hardware requirements for fully utilizing AMD Infinity Fabric? Are there certain AMD GPU models that will have limitations?
- Are there additional RCCL configurations or optimizations needed for specific AI/ML use cases?
- How will Red Hat OpenStack handle potential mixed-GPU environments where both NVIDIA NVLink and AMD Infinity Fabric might be present?
- What diagnostic and monitoring tools can be provided to help users evaluate Infinity Fabric performance in real-time?
This Epic will ensure that Red Hat OpenStack provides top-tier support for AMD Infinity Fabric, catering to users who leverage AMD GPUs for large-scale AI/ML workloads.
- depends on
-
OSPRH-11010 PCI passthrough for AMD GPU MI210
- In Progress