-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
rhel-9.1.0
-
None
-
None
-
rhel-sst-network-drivers
-
ssg_networking
-
None
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
Description of problem:
"openmpi ucx osu_bw" test fails during UCX test when tested on ALL variants of MLX5 ROCE HCA.
This is a regression issue when compared with the RHEL-9.1.0-20220524.0 build for CTC#1 test cycle; also for CTC#2 (the build no longer exists)
Version-Release number of selected component (if applicable):
Clients: rdma-dev-22
Servers: rdma-dev-21
DISTRO=RHEL-9.1.0-20220910.0
+ [22-09-11 20:02:57] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.1 Beta (Plow)
+ [22-09-11 20:02:57] uname -a
Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-162.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 5 10:44:43 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-09-11 20:02:57] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-162.el9.x86_64 root=UUID=376371e8-0b44-45c2-8687-191dbb3737bc ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=beb6c243-17c9-4210-ba33-d2c0b4062b8a console=ttyS1,115200n81
+ [22-09-11 20:02:57] rpm -q rdma-core linux-firmware
rdma-core-41.0-3.el9.x86_64
linux-firmware-20220708-127.el9.noarch
+ [22-09-11 20:02:57] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006
+ [22-09-11 20:02:57] lspci
+ [22-09-11 20:02:57] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Installed:
ucx-cma-1.13.0-1.el9.x86_64 ucx-ib-1.13.0-1.el9.x86_64
ucx-rdmacm-1.13.0-1.el9.x86_64
+ [22-09-11 20:15:38] timeout --preserve-status --kill-after=5m 3m ompi_info --parsable
+ [22-09-11 20:15:38] grep ucx
mca:osc:ucx:version:"mca:2.1.0"
mca:osc:ucx:version:"api:3.0.0"
mca:osc:ucx:version:"component:4.1.1"
mca:pml:ucx:version:"mca:2.1.0"
mca:pml:ucx:version:"api:2.0.0"
mca:pml:ucx:version:"component:4.1.1"
How reproducible:
100%
Steps to Reproduce:
1. Install RHEL-9.1.0-20220910.0 on any of
rdma-dev-19/20, rdma-dev-21/22, rdma-perf-02/03, rdma-virt-02/03 for ROCE
2. Install & execute kernel-kernel-infiniband-ucx test script
3. Watch ucx result on client side
Actual results:
+ [22-09-11 20:19:09] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_0:1 mpitests-osu_bw
[rdma-dev-22:262261] *** Process received signal ***
[rdma-dev-22:262261] Signal: Bus error (7)
[rdma-dev-22:262261] Signal code: Non-existant physical address (2)
[rdma-dev-22:262261] Failing at address: 0x7fef511b6000
[rdma-dev-22:262261] [ 0] /lib64/libc.so.6(+0x54d90)[0x7fef5b310d90]
[rdma-dev-22:262261] [ 1] /lib64/libc.so.6(+0xc290a)[0x7fef5b37e90a]
[rdma-dev-22:262261] [ 2] /lib64/libfabric.so.1(+0x7836e4)[0x7fef593ef6e4]
[rdma-dev-22:262261] [ 3] /lib64/libfabric.so.1(+0x787ebf)[0x7fef593f3ebf]
[rdma-dev-22:262261] [ 4] /lib64/libfabric.so.1(+0x770299)[0x7fef593dc299]
[rdma-dev-22:262261] [ 5] /lib64/libfabric.so.1(+0x7707ed)[0x7fef593dc7ed]
[rdma-dev-22:262261] [ 6] /lib64/libfabric.so.1(+0x753e5d)[0x7fef593bfe5d]
[rdma-dev-22:262261] [ 7] /lib64/libfabric.so.1(+0x747bff)[0x7fef593b3bff]
[rdma-dev-22:262261] [ 8] /usr/lib64/openmpi/lib/openmpi/mca_btl_ofi.so(+0x6cdf)[0x7fef59562cdf]
[rdma-dev-22:262261] [ 9] /usr/lib64/openmpi/lib/libopen-pal.so.40(mca_btl_base_select+0x112)[0x7fef5b1bae62]
[rdma-dev-22:262261] [10] /usr/lib64/openmpi/lib/openmpi/mca_bml_r2.so(mca_bml_r2_component_init+0x18)[0x7fef5956c188]
[rdma-dev-22:262261] [11] /usr/lib64/openmpi/lib/libmpi.so.40(mca_bml_base_init+0x94)[0x7fef5b56cc94]
[rdma-dev-22:262261] [12] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_mpi_init+0x664)[0x7fef5b5accb4]
[rdma-dev-22:262261] [13] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Init+0x72)[0x7fef5b54c482]
[rdma-dev-22:262261] [14] mpitests-osu_bw(+0x25d5)[0x559f152565d5]
[rdma-dev-22:262261] [15] /lib64/libc.so.6(+0x3feb0)[0x7fef5b2fbeb0]
[rdma-dev-22:262261] [16] /lib64/libc.so.6(__libc_start_main+0x80)[0x7fef5b2fbf60]
[rdma-dev-22:262261] [17] mpitests-osu_bw(+0x3fa5)[0x559f15257fa5]
[rdma-dev-22:262261] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node rdma-dev-22 exited on signal 7 (Bus error).
--------------------------------------------------------------------------
+ [22-09-11 20:19:13] RQA_check_result -r 135 -t 'openmpi ucx osu_bw'
Also, a core file was detected afterwards.
Sun 2022-09-11 20:19:10 EDT 262261 0 0 SIGBUS none /usr/lib64/openmpi/bin/mpitests-osu_bw n/a
Expected results:
Result from RHEL-9.1.0-20220524.0
+ [22-09-11 19:14:39] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_1:1 mpitests-osu_bw
- OSU MPI Bandwidth Test v5.8
- Size Bandwidth (MB/s)
1 4.98
2 10.21
4 20.73
8 41.60
16 83.18
32 153.85
64 187.22
128 371.50
256 703.36
512 1112.31
1024 2091.28
2048 2793.22
4096 4250.75
8192 9652.65
16384 9312.98
32768 11362.33
65536 11865.93
131072 12003.34
262144 12084.48
524288 12140.72
1048576 12187.05
2097152 12207.17
4194304 12219.12
+ [22-09-11 19:14:45] RQA_check_result -r 0 -t 'openmpi ucx osu_bw'
Also, there should be NO CORE generated.
Additional info:
- external trackers