-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.7.0
-
None
-
None
-
rhel-sst-network-drivers
-
ssg_networking
-
None
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
x86_64
-
None
Description of problem:
On RHEL-8.7.0, 'openmpi ucx osu_bw' test of our ucx test on hosts with Mellanox mlx4 MT27520 CX-3Pro failed as shown in Actual results section. The failure occurred when running test for RoCE fabric.
Version-Release number of selected component (if applicable):
DISTRO=RHEL-8.7.0-20220817.0
Red Hat Enterprise Linux release 8.7 Beta (Ootpa)
4.18.0-418.el8.x86_64
rdma-core-41.0-1.el8.x86_64
linux-firmware-20220726-110.git150864a4.el8.noarch
+ [22-08-18 10:00:53] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.32.1010
+ [22-08-18 10:00:53] lspci
+ [22-08-18 10:00:53] grep -i -e ethernet -e infiniband -e omni -e ConnectX
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
How reproducible:
Seen it only once so far.
Steps to Reproduce:
1. Install RHEL-8.7.0-20220817.0 on rdma-virt-02/03
2. Install & execute kernel-kernel-infiniband-ucx test script
3. Watch ucx result on client side
Actual results:
+ [22-08-18 10:06:14] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_bond_0:1 mpitests-osu_bw
- OSU MPI Bandwidth Test v5.8
- Size Bandwidth (MB/s)
[rdma-virt-02:219082:0:219082] ib_mlx5_log.c:177 Transport retry count exceeded on mlx5_bond_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
[rdma-virt-02:219082:0:219082] ib_mlx5_log.c:177 RC QP 0x1379 wqe[0]: SEND --e [inl len 10] [rqpn 0x1379 dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:172.31.40.203 sgid_index=7 traffic_class=0]
==== backtrace (tid: 219082) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x15108a68cedc]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x15108a689d41]
2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x15108a68e6a4]
3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x15108a68e9c4]
4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x15108a40259a]
5 /lib64/ucx/libuct_ib.so.0(+0x3c480) [0x15108a419480]
6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x15108a40403d]
7 /lib64/ucx/libuct_ib.so.0(+0x3a48a) [0x15108a41748a]
8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x15108ad53ada]
9 /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_progress+0x34) [0x1510a07f2f94]
10 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_request_default_wait+0x12d) [0x1510a1e9659d]
11 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0x103) [0x1510a1f02643]
12 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Barrier+0xb0) [0x1510a1eadb70]
13 mpitests-osu_bw(+0x1fd0) [0x55d7f1079fd0]
14 /lib64/libc.so.6(__libc_start_main+0xe5) [0x1510a0f6ad85]
15 mpitests-osu_bw(+0x25de) [0x55d7f107a5de]
=================================
[rdma-virt-02:219082] *** Process received signal ***
[rdma-virt-02:219082] Signal: Aborted (6)
[rdma-virt-02:219082] Signal code: (-6)
[rdma-virt-02:219082] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x1510a1308cf0]
[rdma-virt-02:219082] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1510a0f7eaff]
[rdma-virt-02:219082] [ 2] /lib64/libc.so.6(abort+0x127)[0x1510a0f51ea5]
[rdma-virt-02:219082] [ 3] /lib64/libucs.so.0(+0x27d46)[0x15108a689d46]
[rdma-virt-02:219082] [ 4] /lib64/libucs.so.0(ucs_log_default_handler+0xde4)[0x15108a68e6a4]
[rdma-virt-02:219082] [ 5] /lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x15108a68e9c4]
[rdma-virt-02:219082] [ 6] /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a)[0x15108a40259a]
[rdma-virt-02:219082] [ 7] /lib64/ucx/libuct_ib.so.0(+0x3c480)[0x15108a419480]
[rdma-virt-02:219082] [ 8] /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d)[0x15108a40403d]
[rdma-virt-02:219082] [ 9] /lib64/ucx/libuct_ib.so.0(+0x3a48a)[0x15108a41748a]
[rdma-virt-02:219082] [10] /lib64/libucp.so.0(ucp_worker_progress+0x2a)[0x15108ad53ada]
[rdma-virt-02:219082] [11] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_progress+0x34)[0x1510a07f2f94]
[rdma-virt-02:219082] [12] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_request_default_wait+0x12d)[0x1510a1e9659d]
[rdma-virt-02:219082] [13] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0x103)[0x1510a1f02643]
[rdma-virt-02:219082] [14] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Barrier+0xb0)[0x1510a1eadb70]
[rdma-virt-02:219082] [15] mpitests-osu_bw(+0x1fd0)[0x55d7f1079fd0]
[rdma-virt-02:219082] [16] /lib64/libc.so.6(__libc_start_main+0xe5)[0x1510a0f6ad85]
[rdma-virt-02:219082] [17] mpitests-osu_bw(+0x25de)[0x55d7f107a5de]
[rdma-virt-02:219082] *** End of error message ***
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 219082 on node 172.31.45.202 exited on signal 6 (Aborted).
--------------------------------------------------------------------------
+ [22-08-18 10:06:32] RQA_check_result -r 134 -t 'openmpi ucx osu_bw'
+ [22-08-18 10:06:32] local test_pass=0
+ [22-08-18 10:06:32] local test_skip=777
+ [22-08-18 10:06:32] test 4 -gt 0
+ [22-08-18 10:06:32] case $1 in
+ [22-08-18 10:06:32] local rc=134
+ [22-08-18 10:06:32] shift
+ [22-08-18 10:06:32] shift
+ [22-08-18 10:06:32] test 2 -gt 0
+ [22-08-18 10:06:32] case $1 in
+ [22-08-18 10:06:32] local 'msg=openmpi ucx osu_bw'
+ [22-08-18 10:06:32] shift
+ [22-08-18 10:06:32] shift
+ [22-08-18 10:06:32] test 0 -gt 0
+ [22-08-18 10:06:32] '[' -z 134 -o -z 'openmpi ucx osu_bw' ']'
+ [22-08-18 10:06:32] '[' -z /tmp/tmp.LwXAyOokgN/results_ucx-ucx-.txt ']'
+ [22-08-18 10:06:32] '[' -z /tmp/tmp.LwXAyOokgN/results_ucx-ucx-.txt ']'
+ [22-08-18 10:06:32] '[' 134 -eq 0 ']'
+ [22-08-18 10:06:32] '[' 134 -eq 777 ']'
+ [22-08-18 10:06:32] local test_result=FAIL
+ [22-08-18 10:06:32] export result=FAIL
+ [22-08-18 10:06:32] result=FAIL
+ [22-08-18 10:06:32] [[ ! -z '' ]]
+ [22-08-18 10:06:32] printf '%10s | %6s | %s\n' FAIL 134 'openmpi ucx osu_bw'
+ [22-08-18 10:06:32] set +x
—
- TEST RESULT FOR ucx
- Test: openmpi ucx osu_bw
- Result: FAIL
- Return: 134
—
Expected results:
Test to complete successfully.
Additional info:
- external trackers