Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6171

[RHEL9.1] UCX fails in many tests when tested on MLX5 ROCE / IB devices

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • rhel-9.1.0
    • ucx
    • Yes
    • None
    • rhel-sst-network-drivers
    • ssg_networking
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:

      When tested on all variants of MLX5 ROCE HCAs, the following UCX tests failed:

      FAIL | 254 | ucp worker info for a
      FAIL | 254 | ucp worker info for r
      FAIL | 254 | ucp worker info for t
      FAIL | 254 | ucp worker info for m
      FAIL | 254 | ucp worker info for ae
      FAIL | 254 | ucp worker info for re
      FAIL | 254 | ucp worker info for te
      FAIL | 254 | ucp worker info for me
      FAIL | 254 | ucp worker info for aw
      FAIL | 254 | ucp worker info for rw
      FAIL | 254 | ucp worker info for tw
      FAIL | 254 | ucp worker info for mw
      FAIL | 255 | ucx_perftest tag_lat
      FAIL | 255 | ucx_perftest tag_bw
      FAIL | 255 | ucx_perftest ucp_put_lat
      FAIL | 255 | ucx_perftest ucp_put_bw
      FAIL | 255 | ucx_perftest ucp_get
      FAIL | 135 | openmpi ucx osu_bw

      This is a regression issue when comparedd with RHEL-9.1.0-20220524.0, we well as build for CTC#2 testing cycle ( however, this build was not availabe during Beta compose testing cycle )

      Version-Release number of selected component (if applicable):

      Clients: rdma-dev-22
      Servers: rdma-dev-21

      DISTRO=RHEL-9.1.0-20220910.0

      + [22-09-11 20:02:57] cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.1 Beta (Plow)

      + [22-09-11 20:02:57] uname -a
      Linux rdma-dev-22.rdma.lab.eng.rdu2.redhat.com 5.14.0-162.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Sep 5 10:44:43 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

      + [22-09-11 20:02:57] cat /proc/cmdline
      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-162.el9.x86_64 root=UUID=376371e8-0b44-45c2-8687-191dbb3737bc ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=beb6c243-17c9-4210-ba33-d2c0b4062b8a console=ttyS1,115200n81

      + [22-09-11 20:02:57] rpm -q rdma-core linux-firmware
      rdma-core-41.0-3.el9.x86_64
      linux-firmware-20220708-127.el9.noarch

      + [22-09-11 20:02:57] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_2/fw_ver
      ==> /sys/class/infiniband/mlx5_0/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_1/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_2/fw_ver <==
      12.28.2006
      + [22-09-11 20:02:57] lspci
      + [22-09-11 20:02:57] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.0 Ethernet controller: Mellanox Technologies MT27700 Family [ConnectX-4]
      82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
      82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

      Installed:
      ucx-cma-1.13.0-1.el9.x86_64 ucx-ib-1.13.0-1.el9.x86_64
      ucx-rdmacm-1.13.0-1.el9.x86_64

      How reproducible:
      100%

      Steps to Reproduce:
      1. Install the above RHEL-9.1.0-20220910.0 build
      2. Install & execute kernel-kernel-infiniband-ucx test script
      3. Watch ucx result on client side

      Actual results:

      + [22-09-11 20:03:03] timeout --preserve-status --kill-after=5m 3m ucx_info -u a -w
      [1662940983.707711] [rdma-dev-22:259734:0] mm_posix.c:187 UCX ERROR Not enough memory to write total of 4292720 bytes. Please check that /dev/shm or the directory you specified has more available memory.
      [1662940983.707961] [rdma-dev-22:259734:0] uct_mem.c:155 UCX ERROR failed to allocate 4292720 bytes using md posix for mm_recv_desc: Out of memory
      [1662940983.707966] [rdma-dev-22:259734:0] mpool.c:226 UCX ERROR Failed to allocate memory pool (name=mm_recv_desc) chunk: Out of memory
      [1662940983.707969] [rdma-dev-22:259734:0] mm_iface.c:801 UCX ERROR failed to get the first receive descriptor
      <Failed to create UCP worker>
      + [22-09-11 20:03:03] RQA_check_result -r 254 -t 'ucp worker info for a'

      Expected results:

      results from RHEL-9.1.0-20220524.0 build

      + [22-09-11 19:11:54] timeout --preserve-status --kill-after=5m 3m ucx_info -u a -w
      #

      1. UCP worker 'rdma-perf-03:255480'
        #
      2. address: 743 bytes
      3. atomics: 17:dc_mlx5/mlx5_0:1, 18:rc_mlx5/mlx5_0:1
        #
      4. memory: 18.90MB, file descriptors: 43
      5. create time: 156.139 ms
        #
        + [22-09-11 19:11:58] RQA_check_result -r 0 -t 'ucp worker info for a'

      Additional info:

              mschmidt@redhat.com Michal Schmidt
              bchae Brian Chae
              Michal Schmidt Michal Schmidt
              Afom Michael Afom Michael
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: