Uploaded image for project: 'Performance and Scale for AI Platforms'
  1. Performance and Scale for AI Platforms
  2. PSAP-1029

Compare MOFED and DMA-BUF transport performance

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • Compare MOFED and DMA-BUF transport performance
    • Not Selected
    • False
    • False
    • None

      Epic Goal

      •  Understand if DMA-BUF is a viable option for customers comparing performance with MOFED

      Scenarios

      1. Confirm DMA-BUF works with NVIDIA GPU driver, with a basic test with ibwrite on RHEL 9.2 with Kamal's kernel - https://people.redhat.com/kheib/.dmabuf_v6.0/
      2. Test multi-node workload on RHEL 9.2 with MOFED and nv-peermem, as it should already work. It will give us the initial benchmark, without the overhead of OpenShift.
      3. Test multi-node workload on RHEL 9.2 with in-tree mlx5 and DMA-BUF. This will allow us to compare the performance and see if something is wrong before the end of RHEL 9.2 development.
      4. Confirm that the NVIDIA GPU Operator can leverage DMA-BUF on OpenShift 4.13. We could report issues for v23.3.0 of the operator.
      5. Compare MOFED+nv-peermem and mlx5+DMA-BUF performance on OpenShift 4.13 to verify if OpenShift impacts the results.
      6. Rinse and repeat with the ARM servers (arm 5/6/7/8) for completeness

       

      Hardware - perf25/perf27 cluster and the arm5-8 cluster (ARM) in the perf lab

      Acceptance Criteria

      • Performance report
      • Blog (optional)

      Dependencies (internal and external)

      1.  

      Previous Work (Optional):

      Open questions::

       

              jmencak Jiri Mencak
              akamra8979 Ashish Kamra
              Eran Ifrach, Fabien Dupont
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: