-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.9.0
-
Yes
-
None
-
1
-
rhel-sst-network-drivers
-
ssg_networking
-
1
-
False
-
-
None
-
Network Drivers 6
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
Description of problem:
The following 5 ucx_perftests failed with returned code of 134 also with tracebacks when tested on MLX5 IB devices.
in the client host:
PASS | 0 | ucx_perftest am_lat
PASS | 0 | ucx_perftest put_lat
PASS | 0 | ucx_perftest add_lat
PASS | 0 | ucx_perftest fadd
PASS | 0 | ucx_perftest cswap
PASS | 0 | ucx_perftest am_bw
PASS | 0 | ucx_perftest put_bw
PASS | 0 | ucx_perftest add_mr
FAIL | 134 | ucx_perftest tag_lat <<<=============
FAIL | 134 | ucx_perftest tag_bw <<<=============
FAIL | 134 | ucx_perftest ucp_put_lat <<<=============
FAIL | 134 | ucx_perftest ucp_put_bw <<<=============
FAIL | 134 | ucx_perftest ucp_get <<<=============
This is a regression from RHEL-8.8.0-20230228.22.
Version-Release number of selected component (if applicable):
Clients: rdma-perf-03
Servers: rdma-perf-02
DISTRO=RHEL-8.9.0-20230521.41
+ [23-05-30 07:14:21] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 Beta (Ootpa)
+ [23-05-30 07:14:21] uname -a
Linux rdma-perf-03.rdma.lab.eng.rdu2.redhat.com 4.18.0-492.el8.x86_64 #1 SMP Tue May 9 14:50:21 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
+ [23-05-30 07:14:21] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-492.el8.x86_64 root=UUID=d72567dc-2661-4f75-9e4f-6680b3a87cbe ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 reboot=acpi crashkernel=auto resume=UUID=8e21362e-f6c8-4ebc-ae45-77ce2e18e4b0 console=ttyS1,115200n81
+ [23-05-30 07:14:21] rpm -q rdma-core linux-firmware
rdma-core-44.0-2.el8.1.x86_64
linux-firmware-20230515-115.gitd1962891.el8.noarch
+ [23-05-30 07:14:21] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
16.33.1048
==> /sys/class/infiniband/mlx5_1/fw_ver <==
16.33.1048
+ [23-05-30 07:14:21] lspci
+ [23-05-30 07:14:21] grep -i -e ethernet -e infiniband -e omni -e ConnectX
03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.2 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
03:00.3 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5719 Gigabit Ethernet PCIe (rev 01)
07:00.0 Infiniband controller: Mellanox Technologies MT27800 Family [ConnectX-5]
07:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5]
Installed:
ucx-cma-1.13.1-2.el8.x86_64 ucx-ib-1.13.1-2.el8.x86_64
ucx-rdmacm-1.13.1-2.el8.x86_64
How reproducible:
100%
Steps to Reproduce:
1. On the server host, issue the following ucx_perftest commands
a. ucx_perftest tag_lat
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1
b. ucx_perftest tag_bw
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1
c. ucx_perftest ucp_put_lat
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1
d. ucx_perftest ucp_put_bw
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1
e. ucx_perftest ucp_get
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1
2. On the client host, issue the following ucx_perftest commands
a. ucx_perftest tag_lat
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.0.182
b. ucx_perftest tag_bw
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.0.182
c. ucx_perftest ucp_put_lat
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.0.182
d. ucx_perftest ucp_put_bw
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.0.182
e. ucx_perftest ucp_get
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.0.182
3.
Actual results:
On the client host, the results are shown as below.
a. ucx_perftest tag_lat
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.0.182
+ [23-05-30 07:15:40] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[rdma-perf-03:69707:0:69707] ib_mlx5_log.c:177 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[rdma-perf-03:69707:0:69707] ib_mlx5_log.c:177 RC QP 0x143 wqe[47421]: SEND --e [inl len 18] [rqpn 0x143 dlid=6 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 69707) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7f4da453eedc]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7f4da453bd41]
2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x7f4da45406a4]
3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x7f4da45409c4]
4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x7f4da2ccb59a]
5 /lib64/ucx/libuct_ib.so.0(+0x3c470) [0x7f4da2ce2470]
6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x7f4da2ccd02d]
7 /lib64/ucx/libuct_ib.so.0(+0x3a47a) [0x7f4da2ce047a]
8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7f4da49ebaea]
9 ucx_perftest(+0x76644) [0x55e64ab59644]
10 ucx_perftest(+0x69b99) [0x55e64ab4cb99]
11 ucx_perftest(+0xc801) [0x55e64aaef801]
12 ucx_perftest(+0x6bed) [0x55e64aae9bed]
13 ucx_perftest(+0x6cb4) [0x55e64aae9cb4]
14 ucx_perftest(+0x4db2) [0x55e64aae7db2]
15 /lib64/libc.so.6(__libc_start_main+0xe5) [0x7f4da2f65d85]
16 ucx_perftest(+0x4e5e) [0x55e64aae7e5e]
=================================
timeout: the monitored command dumped core
./runtest.sh: line 170: 69706 Aborted $TMOUT ucx_perftest -t $test $specific_args -c 1 $SERVER_IPV4
+ [23-05-30 07:15:41] RQA_check_result -r 134 -t 'ucx_perftest tag_lat'
b. ucx_perftest tag_bw
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.0.182
+ [23-05-30 07:19:17] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[rdma-perf-03:69798:0:69798] ib_mlx5_log.c:177 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[rdma-perf-03:69798:0:69798] ib_mlx5_log.c:177 RC QP 0x14e wqe[34768]: SEND --e [inl len 18] [rqpn 0x14e dlid=6 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 69798) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7faad9b47edc]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7faad9b44d41]
2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x7faad9b496a4]
3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x7faad9b499c4]
4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x7faad82d459a]
5 /lib64/ucx/libuct_ib.so.0(+0x3c470) [0x7faad82eb470]
6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x7faad82d602d]
7 /lib64/ucx/libuct_ib.so.0(+0x3a47a) [0x7faad82e947a]
8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7faad9ff4aea]
9 ucx_perftest(+0x7939a) [0x55fa6c64539a]
10 ucx_perftest(+0x69619) [0x55fa6c635619]
11 ucx_perftest(+0xc801) [0x55fa6c5d8801]
12 ucx_perftest(+0x6bed) [0x55fa6c5d2bed]
13 ucx_perftest(+0x6cb4) [0x55fa6c5d2cb4]
14 ucx_perftest(+0x4db2) [0x55fa6c5d0db2]
15 /lib64/libc.so.6(__libc_start_main+0xe5) [0x7faad856ed85]
16 ucx_perftest(+0x4e5e) [0x55fa6c5d0e5e]
=================================
timeout: the monitored command dumped core
./runtest.sh: line 170: 69797 Aborted $TMOUT ucx_perftest -t $test $specific_args -c 1 $SERVER_IPV4
+ [23-05-30 07:19:17] RQA_check_result -r 134 -t 'ucx_perftest tag_bw'
c. ucx_perftest ucp_put_lat
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.0.182
+ [23-05-30 07:21:42] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[rdma-perf-03:69885:0:69885] ib_mlx5_log.c:177 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[rdma-perf-03:69885:0:69885] ib_mlx5_log.c:177 RC QP 0x159 wqe[27651]: RDMA_WRITE — [rva 0x7fc8d8e12000 rkey 0x182bea] [inl len 8] [rqpn 0x159 dlid=6 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 69885) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7f684032dedc]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7f684032ad41]
2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x7f684032f6a4]
3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x7f684032f9c4]
4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x7f683eaba59a]
5 /lib64/ucx/libuct_ib.so.0(+0x3c470) [0x7f683ead1470]
6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x7f683eabc02d]
7 /lib64/ucx/libuct_ib.so.0(+0x3a47a) [0x7f683eacf47a]
8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7f68407daaea]
9 ucx_perftest(+0x6ae9b) [0x55a849daae9b]
10 ucx_perftest(+0x69418) [0x55a849da9418]
11 ucx_perftest(+0xc801) [0x55a849d4c801]
12 ucx_perftest(+0x6bed) [0x55a849d46bed]
13 ucx_perftest(+0x6cb4) [0x55a849d46cb4]
14 ucx_perftest(+0x4db2) [0x55a849d44db2]
15 /lib64/libc.so.6(__libc_start_main+0xe5) [0x7f683ed54d85]
16 ucx_perftest(+0x4e5e) [0x55a849d44e5e]
=================================
timeout: the monitored command dumped core
./runtest.sh: line 170: 69884 Aborted $TMOUT ucx_perftest -t $test $specific_args -c 1 $SERVER_IPV4
+ [23-05-30 07:21:43] RQA_check_result -r 134 -t 'ucx_perftest ucp_put_lat'
d. ucx_perftest ucp_put_bw
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.0.182
+ [23-05-30 07:25:18] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[rdma-perf-03:69978:0:69978] ib_mlx5_log.c:177 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[rdma-perf-03:69978:0:69978] ib_mlx5_log.c:177 RC QP 0x164 wqe[19027]: RDMA_WRITE — [rva 0x7fd859d77000 rkey 0x182bea] [inl len 8] [rqpn 0x164 dlid=6 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 69978) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7fc326881edc]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7fc32687ed41]
2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x7fc3268836a4]
3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x7fc3268839c4]
4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x7fc32500e59a]
5 /lib64/ucx/libuct_ib.so.0(+0x3c470) [0x7fc325025470]
6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x7fc32501002d]
7 /lib64/ucx/libuct_ib.so.0(+0x3a47a) [0x7fc32502347a]
8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7fc326d2eaea]
9 ucx_perftest(+0x6dd52) [0x55fc98859d52]
10 ucx_perftest(+0x69a98) [0x55fc98855a98]
11 ucx_perftest(+0xc801) [0x55fc987f8801]
12 ucx_perftest(+0x6bed) [0x55fc987f2bed]
13 ucx_perftest(+0x6cb4) [0x55fc987f2cb4]
14 ucx_perftest(+0x4db2) [0x55fc987f0db2]
15 /lib64/libc.so.6(__libc_start_main+0xe5) [0x7fc3252a8d85]
16 ucx_perftest(+0x4e5e) [0x55fc987f0e5e]
=================================
timeout: the monitored command dumped core
./runtest.sh: line 170: 69977 Aborted $TMOUT ucx_perftest -t $test $specific_args -c 1 $SERVER_IPV4
+ [23-05-30 07:25:19] RQA_check_result -r 134 -t 'ucx_perftest ucp_put_bw'
e. ucx_perftest ucp_get
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.0.182
+ [23-05-30 07:27:44] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 631306 1.525 1.588 1.588 4.80 4.80 629785 629785
[rdma-perf-03:70065:0:70065] ib_mlx5_log.c:177 Local QP operation on mlx5_0:1/IB (synd 0x2 vend 0x68 hw_synd 0/31)
[rdma-perf-03:70065:0:70065] ib_mlx5_log.c:177 RC QP 0x16f wqe[63538]: RDMA_READ s-- [rva 0x7fcc997c2000 rkey 0x182bea] [va 0x7f403c1fd600 len 8 lkey 0x17bf7f] [rqpn 0x16f dlid=6 sl=0 port=1 src_path_bits=0]
==== backtrace (tid: 70065) ====
0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x7f4049fb5edc]
1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x7f4049fb2d41]
2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x7f4049fb76a4]
3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x7f4049fb79c4]
4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x7f404874259a]
5 /lib64/ucx/libuct_ib.so.0(+0x3c470) [0x7f4048759470]
6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x7f404874402d]
7 /lib64/ucx/libuct_ib.so.0(+0x3a47a) [0x7f404875747a]
8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x7f404a462aea]
9 ucx_perftest(+0x6f459) [0x55edc766e459]
10 ucx_perftest(+0x69ab8) [0x55edc7668ab8]
11 ucx_perftest(+0xc801) [0x55edc760b801]
12 ucx_perftest(+0x6bed) [0x55edc7605bed]
13 ucx_perftest(+0x6cb4) [0x55edc7605cb4]
14 ucx_perftest(+0x4db2) [0x55edc7603db2]
15 /lib64/libc.so.6(__libc_start_main+0xe5) [0x7f40489dcd85]
16 ucx_perftest(+0x4e5e) [0x55edc7603e5e]
=================================
timeout: the monitored command dumped core
./runtest.sh: line 170: 70064 Aborted $TMOUT ucx_perftest -t $test $specific_args -c 1 $SERVER_IPV4
+ [23-05-30 07:27:46] RQA_check_result -r 134 -t 'ucx_perftest ucp_get'
Expected results:
From RHEL-8.8.0-20230228.22,
a)
+ [23-03-02 12:51:04] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 624935 0.761 0.802 0.802 9.51 9.51 1246798 1246798
Final: 1000000 0.761 0.775 0.792 9.84 9.63 1289887 1262618
+ [23-03-02 12:51:06] RQA_check_result -r 0 -t 'ucx_perftest tag_lat'
b)
+ [23-03-02 12:51:11] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
Final: 1000000 0.087 0.188 0.188 40.66 40.66 5329579 5329579
+ [23-03-02 12:51:12] RQA_check_result -r 0 -t 'ucx_perftest tag_bw'
c)
+ [23-03-02 12:51:17] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 624215 0.779 0.803 0.803 9.50 9.50 1245421 1245421
Final: 1000000 0.765 0.783 0.796 9.74 9.59 1276819 1257037
+ [23-03-02 12:51:19] RQA_check_result -r 0 -t 'ucx_perftest ucp_put_lat'
d)
+ [23-03-02 12:51:24] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
Final: 1000000 0.087 0.202 0.202 37.86 37.86 4962135 4962135
+ [23-03-02 12:51:25] RQA_check_result -r 0 -t 'ucx_perftest ucp_put_bw'
e)
+ [23-03-02 12:51:30] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.0.182
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 629297 1.515 1.593 1.593 4.79 4.79 627781 627781
Final: 1000000 1.515 1.534 1.571 4.97 4.86 651869 636500
+ [23-03-02 12:51:32] RQA_check_result -r 0 -t 'ucx_perftest ucp_get'
Additional info:
- external trackers