-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-9.1.0
-
None
-
None
-
rhel-net-drivers
-
ssg_networking
-
None
-
False
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
-
57,005
Description of problem:
all mvapich2 benchmarks fail with timeouts when run with "mpirun_rsh" command, on HFI OPA device; while all benchmarks run successfully when run with "mpirun" command. This is a REGRESSION from RHEL-9.1.0-20220524.0 build, where NO SUCH failures observed.
Version-Release number of selected component (if applicable):
Clients: rdma-qe-15
Servers: rdma-qe-14
DISTRO=RHEL-9.1.0-20220718.0
+ [22-07-20 23:40:51] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.1 Beta (Plow)
+ [22-07-20 23:40:51] uname -a
Linux rdma-qe-15.rdma.lab.eng.rdu2.redhat.com 5.14.0-130.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jul 15 08:52:03 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-07-20 23:40:51] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-130.el9.x86_64 root=UUID=abd4cfd1-c58e-4c29-836b-d0f6ee695d89 ro intel_idle.max_cstate=0 processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH intel_iommu=on crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=95f34601-5d88-4b38-bff8-afc96012d612 console=ttyS1,115200
+ [22-07-20 23:40:51] rpm -q rdma-core linux-firmware
rdma-core-37.2-1.el9.x86_64
linux-firmware-20220509-126.el9.noarch
+ [22-07-20 23:40:51] tail /sys/class/infiniband/hfi1_0/fw_ver
1.27.0
+ [22-07-20 23:40:51] lspci
+ [22-07-20 23:40:51] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10)
Installed:
mpitests-mvapich2-psm2-5.8-1.el9.x86_64 mvapich2-psm2-2.3.6-3.el9.x86_64
How reproducible:
100%
Steps to Reproduce:
1. bring up the RDMA hosts mentioned above with RHEL9.1 build
2. set up RDMA hosts for mvapich2 benchamrk tests
3. run one of the mvapich2 benchmark with "mpirun_rsh" command, as the following:
timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core PSM2_PKEY=0x8001 mpitests-IMB-MPI1 PingPong -time 1.5
Actual results:
[rdma-qe-14.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][mv2_psm_err_handler] PSM error handler: Operation timed out : Detected connection timeout: rdma-qe-15
psm_ep_connect failed with error Operation timed out
[rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][mv2_psm_err_handler] PSM error handler: Operation timed out : Detected connection timeout: LID=6:7.0
psm_ep_connect failed with error Operation timed out
[rdma-qe-14.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][psm_connect_alltoall] psm_connect_alltoall failed
[rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][psm_connect_alltoall] psm_connect_alltoall failed
[rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job
[rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job
+ [22-07-21 03:20:02] __MPI_check_result 1 mpitests-mvapich2-psm2 IMB-MPI1 PingPong mpirun_rsh /root/hfile_one_core
Expected results:
Normal execution of the benchmarks with stats output
Additional info:
This same issue exists on DISTRO=RHEL-9.1.0-20220710.3 build, as well
- external trackers