XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • AMD Infinity Fabric Support
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • Proposed
    • Proposed
    • To Do
    • RHOSSTRAT-240 - High-Bandwidth GPU Interconnects support
    • Proposed
    • Proposed

      Goal

      The goal of this Epic is to integrate AMD Infinity Fabric support within Red Hat OpenStack, allowing for efficient communication between AMD GPUs during AI/ML workloads. By incorporating RCCL (Radeon Collective Communication Library), users will benefit from faster data transfer and lower latency between GPUs, improving the performance and scalability of distributed training tasks. This Epic will help users maximize their AMD GPU infrastructure for high-performance AI workloads.

      Acceptance Criteria

       

      • AMD Infinity Fabric is fully supported within Red Hat OpenStack for multi-GPU setups, enabling high-speed inter-GPU communication.
      • RCCL integration ensures optimized communication across AMD GPUs, improving performance in distributed AI workloads.
      • Performance benchmarks demonstrate improvements in data transfer speed and communication efficiency compared to standard PCIe connections.
      • Documentation provides clear instructions for configuring AMD Infinity Fabric and RCCL for AI workloads on Red Hat OpenStack.
      • Infinity Fabric operates reliably in different node configurations and shows significant performance gains in popular AI frameworks (e.g., PyTorch).
      • Demontrate vLLM use of RCCL with a RHEL AI VM and multiple AMD GPUs.
      • Validate four MI210 GPUs with Infinity Fabric
      • Validate eight MI300X GPUs with Infinity Fabric

      Open Questions

      Any additional details, questions, or decisions that need to be made/addressed.

      • What are the specific hardware requirements for fully utilizing AMD Infinity Fabric? Are there certain AMD GPU models that will have limitations?
      • Are there additional RCCL configurations or optimizations needed for specific AI/ML use cases?
      • How will Red Hat OpenStack handle potential mixed-GPU environments where both NVIDIA NVLink and AMD Infinity Fabric might be present?
      • What diagnostic and monitoring tools can be provided to help users evaluate Infinity Fabric performance in real-time?

      This Epic will ensure that Red Hat OpenStack provides top-tier support for AMD Infinity Fabric, catering to users who leverage AMD GPUs for large-scale AI/ML workloads.

              Unassigned Unassigned
              egallen Erwan Gallen
              rhos-dfg-ai-enablement
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: