-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.8.0
-
Yes
-
None
-
1
-
rhel-net-drivers
-
ssg_networking
-
1
-
False
-
False
-
-
None
-
Network Drivers 6
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
-
57,005
+++ This bug was initially created as a clone of Bug #2148553 +++
Description of problem:
All mvapich2 benchmarks fail with RC134, with "mpirun" command, or RC1, with "mpirun_rsh" command. This happens on a host with MT27700 CX-4 device and the transport is IB0 or IB1.
However, this takes place specifically on rdma-dev-19 / rdma-dev-20 host pairs when run in RDMA server & client, respectively.
This is a REGRESSION from RHEL-8.7.0 the mvapich2 on IB0 on the same HCA on rdma-dev-19 / rdma-dev-20, where all benchmarks PASSED
Version-Release number of selected component (if applicable):
Clients: rdma-dev-20
Servers: rdma-dev-19
DISTRO=RHEL-8.8.0-20221120.2
+ [22-11-25 16:18:29] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.8 Beta (Ootpa)
+ [22-11-25 16:18:29] uname -a
Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 4.18.0-438.el8.x86_64 #1 SMP Mon Nov 14 13:08:07 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-11-25 16:18:29] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-438.el8.x86_64 root=UUID=4dcc79ce-c280-4af4-9b75-02011855b115 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=1c9d8b9c-d969-417d-ad02-b9e6279dfac8 console=ttyS1,115200n81
+ [22-11-25 16:18:29] rpm -q rdma-core linux-firmware
rdma-core-41.0-1.el8.x86_64
linux-firmware-20220726-110.git150864a4.el8.noarch
+ [22-11-25 16:18:29] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_3/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.31.1014
+ [22-11-25 16:18:29] lspci
+ [22-11-25 16:18:29] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Installed:
mpitests-mvapich2-5.8-1.el8.x86_64 mvapich2-2.3.6-1.el8.x86_64
How reproducible:
100%
Steps to Reproduce:
1. bring up the RDMA hosts mentioned above with RHEL8.8 build
2. set up RDMA hosts for mvapich2 benchamrk tests
3. run one of the mvapich2 benchmark with "mpirun" command, as the following:
a) mpirun command
timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-MPI1 PingPong -time 1.5
-
-
- buffer overflow detected ***: terminated
[rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)
- buffer overflow detected ***: terminated
-
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 48458 RUNNING AT 172.31.0.120
= EXIT CODE: 134
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
[proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions
b) "mpirun_rsh" command
+ [22-11-25 14:26:27] timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 PingPong -time 1.5
-
-
- buffer overflow detected ***: terminated
[rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6) - buffer overflow detected ***: terminated
[rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)
[rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
[rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
[rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][child_handler] MPI process (rank: 0, pid: 51624) terminated with signal 6 -> abort job
[rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
[rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
[rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][child_handler] MPI process (rank: 1, pid: 52467) terminated with signal 6 -> abort job
[rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 172.31.0.119 aborted: Error while reading a PMI socket (4)
+ [22-11-25 14:26:30] __MPI_check_result 1 mpitests-mvapich2 IMB-MPI1 PingPong mpirun_rsh /root/hfile_one_core
- buffer overflow detected ***: terminated
-
Actual results:
Expected results:
Normal run with stats
Additional info:
On other hosts, like rdma-dev-21 and rdma-dev-22 pair, with the same MT27700 CX-4 device, with IB0, all mvapich2 benchmarks PASSED. Also, on rdma-perf-02/03 host pair, with mlx5 MT27800 CX-5 ib0, all mvapich2 benchmarks PASSED.
- is blocked by
-
RHEL-6130 [RHEL9.2] all mvapich2 benchmarks fail when run on MLX5 IB0 or IB1 on MT27700 CX-4
-
- Closed
-
- external trackers