-
Epic
-
Resolution: Done
-
Normal
-
None
-
None
-
None
-
Compare MOFED and DMA-BUF transport performance
-
Not Selected
-
False
-
False
-
None
Epic Goal
- Understand if DMA-BUF is a viable option for customers comparing performance with MOFED
Scenarios
- Confirm DMA-BUF works with NVIDIA GPU driver, with a basic test with ibwrite on RHEL 9.2 with Kamal's kernel - https://people.redhat.com/kheib/.dmabuf_v6.0/
- Test multi-node workload on RHEL 9.2 with MOFED and nv-peermem, as it should already work. It will give us the initial benchmark, without the overhead of OpenShift.
- Test multi-node workload on RHEL 9.2 with in-tree mlx5 and DMA-BUF. This will allow us to compare the performance and see if something is wrong before the end of RHEL 9.2 development.
- Confirm that the NVIDIA GPU Operator can leverage DMA-BUF on OpenShift 4.13. We could report issues for v23.3.0 of the operator.
- Compare MOFED+nv-peermem and mlx5+DMA-BUF performance on OpenShift 4.13 to verify if OpenShift impacts the results.
- Rinse and repeat with the ARM servers (arm 5/6/7/8) for completeness
Hardware - perf25/perf27 cluster and the arm5-8 cluster (ARM) in the perf lab
Acceptance Criteria
- Performance report
- Blog (optional)
Dependencies (internal and external)
Previous Work (Optional):
- …
Open questions::
- …