-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.7.0
-
None
-
None
-
1
-
rhel-net-drivers
-
ssg_networking
-
1
-
False
-
False
-
-
None
-
Network Drivers 6
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
-
57,005
Description of problem:
OSU acc_latency benchmark fails with following error message:
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
Version-Release number of selected component (if applicable):
Clients: rdma-dev-02
Servers: rdma-perf-06
DISTRO=RHEL-8.7.0-20220524.0
+ [22-05-26 02:08:38] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.7 Beta (Ootpa)
+ [22-05-26 02:08:38] uname -a
Linux rdma-dev-02.rdma.lab.eng.rdu2.redhat.com 4.18.0-393.el8.x86_64 #1 SMP Wed May 18 12:44:50 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-05-26 02:08:38] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-393.el8.x86_64 root=UUID=fd7a6a9d-cd42-4b62-9933-1f5f3d4c927b ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=auto resume=UUID=9ea769dc-0bb3-455f-a1b3-d99cd5d33215 console=ttyS1,115200
+ [22-05-26 02:08:38] rpm -q rdma-core linux-firmware
rdma-core-37.2-1.el8.x86_64
linux-firmware-20220210-107.git6342082c.el8.noarch
+ [22-05-26 02:08:38] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/qedr0/fw_ver <==
8. 59. 1. 0
==> /sys/class/infiniband/qedr1/fw_ver <==
8. 59. 1. 0
+ [22-05-26 02:08:38] lspci
+ [22-05-26 02:08:38] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
Installed:
mpitests-openmpi-5.8-1.el8.x86_64 openmpi-1:4.1.1-3.el8.x86_64
openmpi-devel-1:4.1.1-3.el8.x86_64
How reproducible:
100%
Steps to Reproduce:
1. With the above build on qedr roce device
2. set up both RDMA server and client for openmpi
3. On the client side, run the following benchmark command
imeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_roce.45 --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency
Actual results:
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:289 mca_pml_ucx_init
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:114 Pack local worker address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:351 created ucp context 0x56170ef84000, worker 0x56170efd7e50
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:289 mca_pml_ucx_init
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:114 Pack remote worker address, size 38
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:114 Pack local worker address, size 141
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:351 created ucp context 0x55e45dfd7160, worker 0x55e45e524ca0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:182 Got proc 0 address, size 141
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:411 connecting to proc. 0
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:182 Got proc 1 address, size 141
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:411 connecting to proc. 1
- OSU MPI_Accumulate latency Test v5.8
- Window creation: MPI_Win_allocate
- Synchronization: MPI_Win_flush
- Size Latency (us)
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:182 Got proc 0 address, size 38
[rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:411 connecting to proc. 0
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:182 Got proc 1 address, size 38
[rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:411 connecting to proc. 1
1 2570.11
2 2570.11
4 2570.11
8 2570.11
16 2570.18
32 2570.10
+ [22-05-26 02:41:36] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core
Expected results:
Normal execution with proper stats output
Additional info:
- external trackers