-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-9.1.0
-
Yes
-
None
-
rhel-sst-network-drivers
-
ssg_networking
-
None
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
Unspecified
-
None
Description of problem:
When tested on all variants of MLX5 ROCE HCAs, the following UCX tests failed:
FAIL | 254 | ucp worker info for a
FAIL | 254 | ucp worker info for r
FAIL | 254 | ucp worker info for t
FAIL | 254 | ucp worker info for m
FAIL | 254 | ucp worker info for ae
FAIL | 254 | ucp worker info for re
FAIL | 254 | ucp worker info for te
FAIL | 254 | ucp worker info for me
FAIL | 254 | ucp worker info for aw
FAIL | 254 | ucp worker info for rw
FAIL | 254 | ucp worker info for tw
FAIL | 254 | ucp worker info for mw
FAIL | 255 | ucx_perftest tag_lat
FAIL | 255 | ucx_perftest tag_bw
FAIL | 255 | ucx_perftest ucp_put_lat
FAIL | 255 | ucx_perftest ucp_put_bw
FAIL | 255 | ucx_perftest ucp_get
FAIL | 135 | openmpi ucx osu_bw
This is a regression issue when comparedd with RHEL-9.1.0-20220524.0, we well as build for CTC#2 testing cycle ( however, this build was not availabe during Beta compose testing cycle )
Version-Release number of selected component (if applicable):
Clients: rdma-dev-22
Servers: rdma-dev-21
DISTRO=RHEL-9.1.0-20220910.0
+ [22-09-11 20:02:57] cat /etc/redhat-release
Red Hat Enterprise Linux release 9.1 Beta (Plow)
+ [22-09-11 20:02:57] uname -a
Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-162.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 5 10:44:43 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux
+ [22-09-11 20:02:57] cat /proc/cmdline
BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-162.el9.x86_64 root=UUID=376371e8-0b44-45c2-8687-191dbb3737bc ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=beb6c243-17c9-4210-ba33-d2c0b4062b8a console=ttyS1,115200n81
+ [22-09-11 20:02:57] rpm -q rdma-core linux-firmware
rdma-core-41.0-3.el9.x86_64
linux-firmware-20220708-127.el9.noarch
+ [22-09-11 20:02:57] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver
==> /sys/class/infiniband/mlx5_0/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_1/fw_ver <==
12.28.2006
==> /sys/class/infiniband/mlx5_2/fw_ver <==
12.28.2006
+ [22-09-11 20:02:57] lspci
+ [22-09-11 20:02:57] grep -i -e ethernet -e infiniband -e omni -e ConnectX
01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
Installed:
ucx-cma-1.13.0-1.el9.x86_64 ucx-ib-1.13.0-1.el9.x86_64
ucx-rdmacm-1.13.0-1.el9.x86_64
How reproducible:
100%
Steps to Reproduce:
1. Install the above RHEL-9.1.0-20220910.0 build
2. Install & execute kernel-kernel-infiniband-ucx test script
3. Watch ucx result on client side
Actual results:
+ [22-09-11 20:03:03] timeout --preserve-status --kill-after=5m 3m ucx_info -u a -w
[1662940983.707711] [rdma-dev-22:259734:0] mm_posix.c:187 UCX ERROR Not enough memory to write total of 4292720 bytes. Please check that /dev/shm or the directory you specified has more available memory.
[1662940983.707961] [rdma-dev-22:259734:0] uct_mem.c:155 UCX ERROR failed to allocate 4292720 bytes using md posix for mm_recv_desc: Out of memory
[1662940983.707966] [rdma-dev-22:259734:0] mpool.c:226 UCX ERROR Failed to allocate memory pool (name=mm_recv_desc) chunk: Out of memory
[1662940983.707969] [rdma-dev-22:259734:0] mm_iface.c:801 UCX ERROR failed to get the first receive descriptor
<Failed to create UCP worker>
+ [22-09-11 20:03:03] RQA_check_result -r 254 -t 'ucp worker info for a'
Expected results:
results from RHEL-9.1.0-20220524.0 build
+ [22-09-11 19:11:54] timeout --preserve-status --kill-after=5m 3m ucx_info -u a -w
#
- UCP worker 'rdma-perf-03:255480'
# - address: 743 bytes
- atomics: 17:dc_mlx5/mlx5_0:1, 18:rc_mlx5/mlx5_0:1
# - memory: 18.90MB, file descriptors: 43
- create time: 156.139 ms
#
+ [22-09-11 19:11:58] RQA_check_result -r 0 -t 'ucp worker info for a'
Additional info:
- external trackers