-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-9.2.0
-
None
-
None
-
rhel-net-drivers
-
ssg_networking
-
None
-
False
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
-
57,005
Description of problem:
Some of the OPENMPI benchmarks time-out with RC1 when run on CXGB4 devices.
The failed benchmarks are as the following:
FAIL | 1 | openmpi IMB-IO P_Write_indv mpirun one_core
FAIL | 1 | openmpi IMB-IO P_Write_expl mpirun one_core
FAIL | 1 | openmpi IMB-IO P_Write_shared mpirun one_core
FAIL | 1 | openmpi IMB-IO P_Write_priv mpirun one_core
FAIL | 1 | openmpi IMB-IO C_Write_indv mpirun one_core
FAIL | 1 | openmpi IMB-IO C_Write_expl mpirun one_core
FAIL | 1 | openmpi IMB-IO C_Write_shared mpirun one_core
FAIL | 1 | openmpi OSU get_acc_latency mpirun one_core
FAIL | 1 | openmpi OSU mbw_mr mpirun one_core
This issue seems to be consistent in the following hosts.
a. rdma-qe-12 (cxgb4 t5 iw 40) / rdma-perf-06 (cxgb4 T6 iw 100)
beaker job : https://beaker.engineering.redhat.com/jobs/7293293
b. rdma-dev-13 (cxgb4 t6 iw 100) / rdma-perf-06 (cxgb4 T6 iw 100)
https://beaker.engineering.redhat.com/jobs/7292260
Version-Release number of selected component (if applicable):
Clients: rdma-perf-06
Servers: rdma-qe-12
DISTRO=RHEL-9.2.0-20221129.2
+ [22-11-30 18:38:52] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.2 Beta (Plow)
+ [22-11-30 18:38:52] uname -a
Linux rdma-perf-06.rdma.lab.eng.rdu2.redhat.com 5.14.0-202.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 28 08:49:47 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-11-30 18:38:52] cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-202.el9.x86_64 root=UUID=60790874-ea0a-4a35-8447-d83f2475913b ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=08d83c36-2fab-45c6-a375-8bb16849b90a console=ttyS0,115200n81
+ [22-11-30 18:38:52] rpm -q rdma-core linux-firmware
rdma-core-41.0-3.el9.x86_64
linux-firmware-20221012-128.el9.noarch
+ [22-11-30 18:38:52] tail /sys/class/infiniband/cxgb4_0/fw_ver /sys/class/infiniband/hfi1_0/fw_ver /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
==> /sys/class/infiniband/cxgb4_0/fw_ver <==
1.27.0.0
==> /sys/class/infiniband/hfi1_0/fw_ver <==
1.27.0
==> /sys/class/infiniband/mlx5_0/fw_ver <==
20.99.5392
==> /sys/class/infiniband/mlx5_1/fw_ver <==
20.99.5392
==> /sys/class/infiniband/qedr0/fw_ver <==
8.59.1.0
==> /sys/class/infiniband/qedr1/fw_ver <==
8.59.1.0
+ [22-11-30 18:38:52] lspci
+ [22-11-30 18:38:52] grep -i -e ethernet -e infiniband -e omni -e ConnectX
19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
5e:00.0 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.1 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.2 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.3 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
5e:00.4 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
af:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
af:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)
How reproducible:
100% in the above combinations of RDMA hosts
Steps to Reproduce:
1. Please refer to the beaker job outputs in client hosts mentioned above.
2.
3.
Actual results:
Expected results:
Additional info:
However, with the following CXGB4 hosts combinations, ALL OPENMPI benchmarks PASSED
a. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-perf-06/07 - mpich2,openmpi ]
beaker job : https://beaker.engineering.redhat.com/jobs/7291986
b. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-dev-13/rdma-qe-12 - mpich2,openmpi ] - J:7293324
mpi/openmpi test results on rdma-dev-13/rdma-qe-12 & Beaker job J:7293324:
5.14.0-202.el9.x86_64, rdma-core-41.0-3.el9, cxgb4, iw, T520-CR & cxgb4_0
Result | Status | Test
-------------------------------------------------
Checking for failures and known issues:
no test failures
beaker job : https://beaker.engineering.redhat.com/jobs/7293324
- external trackers