-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-8.9.0
-
Yes
-
None
-
1
-
rhel-sst-network-drivers
-
ssg_networking
-
1
-
False
-
-
None
-
Network Drivers 6
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
Description of problem:
When the UCX was tested on MLX5 RoCE, the following tests failed with return code of 143.
FAIL | 143 | ucx_perftest tag_lat
FAIL | 143 | ucx_perftest tag_bw
FAIL | 143 | ucx_perftest ucp_put_lat
FAIL | 143 | ucx_perftest ucp_put_bw
FAIL | 143 | ucx_perftest ucp_get
This is a regression from RHEL-8.8.0-20230228.22 build.
Version-Release number of selected component (if applicable):
Clients: rdma-dev-20
Servers: rdma-dev-19
DISTRO=RHEL-8.9.0-20230521.41
+ [23-05-25 11:07:48] cat /etc/redhat-release
Red Hat Enterprise Linux release 8.9 Beta (Ootpa)
+ [23-05-25 11:07:48] uname -a
Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 4.18.0-492.el8.x86_64 #1 SMP Tue May 9 14:50:21 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux
+ [23-05-25 11:07:48] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-492.el8.x86_64 root=UUID=ceb5ecf0-f76e-43d9-a805-fb8115e8ca03 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=318b16c0-dfd8-45e0-be0e-f6400597df19 console=ttyS1,115200n81
+ [23-05-25 11:07:48] rpm -q rdma-core linux-firmware
rdma-core-44.0-2.el8.1.x86_64
linux-firmware-20230515-115.gitd1962891.el8.noarch
+ [23-05-25 11:07:48] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_3/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
14.31.1014
+ [23-05-25 11:07:48] lspci
+ [23-05-25 11:07:48] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Installed:
ucx-cma-1.13.1-2.el8.x86_64 ucx-ib-1.13.1-2.el8.x86_64
ucx-rdmacm-1.13.1-2.el8.x86_64
How reproducible:
100%
Steps to Reproduce:
Please, refer to the following beaker test job for details.
https://beaker.engineering.redhat.com/jobs/7883980
1.On the server side, issue the following commands
a) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1
b) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1
c) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1
d) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1
e) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1
2.On the client side, issue the following commands
a) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.45.119
b) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.45.119
c) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.45.119
d) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.45.119
e) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.45.119
3.
Actual results:
a)
+ [23-05-25 11:08:21] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[1685027476.411834] [rdma-dev-20:366583:0] perftest.c:129 UCX ERROR recv() failed: Connection reset by peer
+ [23-05-25 11:11:21] RQA_check_result -r 143 -t 'ucx_perftest tag_lat'
b)
+ [23-05-25 11:11:26] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.45.119
+ [23-05-25 11:14:26] RQA_check_result -r 143 -t 'ucx_perftest tag_bw'
c)
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.45.119
+ [23-05-25 11:17:31] RQA_check_result -r 143 -t 'ucx_perftest ucp_put_lat'
d)
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[1685028032.034652] [rdma-dev-20:366801:0] perftest.c:129 UCX ERROR recv() failed: Connection reset by peer
+ [23-05-25 11:20:37] RQA_check_result -r 143 -t 'ucx_perftest ucp_put_bw'
e)
timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[1685028217.200962] [rdma-dev-20:366878:0] perftest.c:129 UCX ERROR recv() failed: Connection reset by peer
+ [23-05-25 11:23:42] RQA_check_result -r 143 -t 'ucx_perftest ucp_get'
Expected results:
The following are from RHEL-8.8.0-20230228.22 build testing.
a)
+ [23-03-02 17:17:44] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 598721 0.828 0.835 0.835 9.14 9.14 1197441 1197441
Final: 1000000 0.827 0.835 0.835 9.14 9.14 1197835 1197599
+ [23-03-02 17:17:46] RQA_check_result -r 0 -t 'ucx_perftest tag_lat'
b)
+ [23-03-02 17:17:52] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
Final: 1000000 0.068 0.098 0.098 78.11 78.11 10237401 10237401
+ [23-03-02 17:17:52] RQA_check_result -r 0 -t 'ucx_perftest tag_bw'
c)
+ [23-03-02 17:17:57] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 593097 0.841 0.843 0.843 9.05 9.05 1186195 1186195
Final: 1000000 0.841 0.843 0.843 9.05 9.05 1185993 1186113
+ [23-03-02 17:17:59] RQA_check_result -r 0 -t 'ucx_perftest ucp_put_lat'
d)
+ [23-03-02 17:18:04] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
overhead (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
Final: 1000000 0.068 0.094 0.094 80.87 80.87 10599975 10599975
+ [23-03-02 17:18:05] RQA_check_result -r 0 -t 'ucx_perftest ucp_put_bw'
e)
+ [23-03-02 17:18:10] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.45.119
----------------------------------------------------------------------------------------------
latency (usec) | bandwidth (MB/s) | message rate (msg/s) |
----------------------------------------------------------------------------------
Stage |
|
50.0%ile | average | overall | average | overall | average | overall |
----------------------------------------------------------------------------------
[thread 0] 610706 1.614 1.637 1.637 4.66 4.66 610706 610706
Final: 1000000 1.602 1.638 1.638 4.66 4.66 610552 610646
+ [23-03-02 17:18:12] RQA_check_result -r 0 -t 'ucx_perftest ucp_get'
Additional info:
- external trackers