-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.8.0
-
None
-
None
-
1
-
rhel-sst-network-drivers
-
ssg_networking
-
1
-
False
-
-
None
-
Network Drivers 6
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
Description of problem:
After "openmpi ucx osu_bw" test, RDMA server host was left with SIG 6 core file, when the test was run on MLX5 ROCE with bonding/teaming. This took place on RDMA lab machines of rdma-dev-19/20 pair - rdma-dev-19 as server in bonding and rdma-dev-20 as client hosts in teaming.
On the rdma-dev-19 (server):
TIME PID UID GID SIG COREFILE EXE
Mon 2022-11-28 11:39:01 EST 79390 0 0 6 present /usr/lib64/openmpi/bin/mpitests-osu_bw
total 2452
rw-r----. 1 root root 2504822 Nov 28 11:39 core.mpitests-osu_bw.0.02d616224e974648ae9e3d757a08ba58.79390.1669653541000000.lz4
Red Hat Enterprise Linux release 8.8 Beta (Ootpa)
This seems to be a regression, as the same test in RHEL8.7.0 did not produce the core in the server side.
Version-Release number of selected component (if applicable):
Clients: rdma-dev-20
Servers: rdma-dev-19
DISTRO=RHEL-8.8.0-20221120.2
+ [22-11-28 11:37:35] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.8 Beta (Ootpa)
+ [22-11-28 11:37:35] uname -a
Linux rdma-dev-19.rdma.lab.eng.rdu2.redhat.com 4.18.0-438.el8.x86_64 #1 SMP Mon Nov 14 13:08:07 EST 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-11-28 11:37:35] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-438.el8.x86_64 root=UUID=67eba586-c572-49ad-8973-e9030c9f66e6 ro console=tty0 rd_NO_PLYMOUTH intel_idle.max_cstate=0 intel_iommu=on iommu=on processor.max_cstate=0 crashkernel=auto resume=UUID=a124f939-9473-482f-bc5f-f093bc222674 console=ttyS1,115200
+ [22-11-28 11:37:35] rpm -q rdma-core linux-firmware
rdma-core-41.0-1.el8.x86_64
linux-firmware-20220726-110.git150864a4.el8.noarch
+ [22-11-28 11:37:35] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_3/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.31.1014
+ [22-11-28 11:37:35] lspci
+ [22-11-28 11:37:35] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Installed:
ucx-cma-1.13.0-1.el8.x86_64 ucx-ib-1.13.0-1.el8.x86_64
ucx-rdmacm-1.13.0-1.el8.x86_64
How reproducible:
100%
Steps to Reproduce:
1. Install RHEL-8.8.0-20221120.2 on rdma-dev-19/20
2. Install & execute kernel-kernel-infiniband-ucx test script
3. Watch ucx result on client side
Actual results:
In rdma-dev-19 (server host), the above mentioned core file will be found.
Expected results:
No core files should be produced after the "openmpi ucx osu_bw" test
Additional info:
- external trackers