-
Epic
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
None
-
dma-buf vs nv_peer_mem
-
RHOAI, Training
-
Not Selected
-
False
-
False
-
None
Epic Goal
Why is this important?
Today, to do multi node training with GPUDirect RDMA, i.e. direct memory access between GPUs across machines over RDMA, you need to instead the out-of-tree version of the mlx5 driver, called MOFED, and the nv_peer_mem module provided by the GPU driver.
However, there's a generic mechanism in recent kernel versions call DMA-BUF that allows using the in-tree mlx5 driver without the nv_peer_mem driver. So, that simplifies the stack and removes a thorn is our side for supportability. The DMA-BUF feature is backported in RHEL 9.2.
What we want to understand now is what is the difference in performance between MOFED+nv_peer_mem and mlx5+DMA-BUF. We can do link level tests between two RHEL 9.2 machines by running send_bw or ib_write with CUDA option enabled, but it doesn't say anything about real workload performance.
So, we want to run some multi-node GPU accelerated AI/ML benchmark in different setups.
Scenarios
- RHEL with MOFED+nv_peer_mem to have a baseline.
- RHEL with mlx5+DMA-BUF to compare on RHEL.
- RHEL + podman with MOFED+nv_peer_mem.
- RHEL + podman with mlx5+DMA-BUF.
- OpenShift 4.13 with GPU and Network Operator with MOFED+nv_peer_mem.
- OpenShift 4.13 with GPU and Network Operator with mlx5+DMA-BUF.
- RHEL + Microshift 4.13 with MOFED+nv_peer_mem.
- RHEL + Microshift 4.13 with mlx5+DMA-BUF.
Acceptance Criteria
- ...
Dependencies (internal and external)
- ...
Previous Work (Optional):
- …
Open questions::
- …
Done Checklist
- CI - CI is running, tests are automated and merged.
- Release Enablement <link to Feature Enablement Presentation>
- DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
- DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
- DEV - Downstream build attached to advisory: <link to errata>
- QE - Test plans in Polarion: <link or reference to Polarion>
- QE - Automated tests merged: <link or reference to automated tests>
- DOC - Downstream documentation merged: <link to meaningful PR>