Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6074

[RHEL8.9] most of ucx_perftest tests failed with RC of 143 on MLX5 RoCE devices

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • rhel-8.9.0
    • ucx
    • Yes
    • None
    • 1
    • rhel-sst-network-drivers
    • ssg_networking
    • 1
    • False
    • Hide

      None

      Show
      None
    • None
    • Network Drivers 6
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:

      When the UCX was tested on MLX5 RoCE, the following tests failed with return code of 143.

      FAIL | 143 | ucx_perftest tag_lat
      FAIL | 143 | ucx_perftest tag_bw
      FAIL | 143 | ucx_perftest ucp_put_lat
      FAIL | 143 | ucx_perftest ucp_put_bw
      FAIL | 143 | ucx_perftest ucp_get

      This is a regression from RHEL-8.8.0-20230228.22 build.

      Version-Release number of selected component (if applicable):

      Clients: rdma-dev-20
      Servers: rdma-dev-19

      DISTRO=RHEL-8.9.0-20230521.41

      + [23-05-25 11:07:48] cat /etc/redhat-release
      Red Hat Enterprise Linux release 8.9 Beta (Ootpa)

      + [23-05-25 11:07:48] uname -a
      Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 4.18.0-492.el8.x86_64 #1 SMP Tue May 9 14:50:21 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux

      + [23-05-25 11:07:48] cat /proc/cmdline
      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-492.el8.x86_64 root=UUID=ceb5ecf0-f76e-43d9-a805-fb8115e8ca03 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=318b16c0-dfd8-45e0-be0e-f6400597df19 console=ttyS1,115200n81

      + [23-05-25 11:07:48] rpm -q rdma-core linux-firmware
      rdma-core-44.0-2.el8.1.x86_64
      linux-firmware-20230515-115.gitd1962891.el8.noarch

      + [23-05-25 11:07:48] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
      ==> /sys/class/infiniband/mlx5_2/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_3/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
      14.31.1014

      + [23-05-25 11:07:48] lspci
      + [23-05-25 11:07:48] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
      04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
      82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
      82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

      Installed:
      ucx-cma-1.13.1-2.el8.x86_64 ucx-ib-1.13.1-2.el8.x86_64
      ucx-rdmacm-1.13.1-2.el8.x86_64

      How reproducible:
      100%

      Steps to Reproduce:

      Please, refer to the following beaker test job for details.

      https://beaker.engineering.redhat.com/jobs/7883980

      1.On the server side, issue the following commands

      a) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1

      b) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1

      c) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1

      d) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1

      e) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1

      2.On the client side, issue the following commands

      a) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.45.119

      b) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.45.119

      c) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.45.119

      d) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.45.119

      e) timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.45.119

      3.

      Actual results:

      a)
      + [23-05-25 11:08:21] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          latency (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      [1685027476.411834] [rdma-dev-20:366583:0] perftest.c:129 UCX ERROR recv() failed: Connection reset by peer
      + [23-05-25 11:11:21] RQA_check_result -r 143 -t 'ucx_perftest tag_lat'

      b)
      + [23-05-25 11:11:26] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.45.119
      + [23-05-25 11:14:26] RQA_check_result -r 143 -t 'ucx_perftest tag_bw'

      c)
      timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.45.119
      + [23-05-25 11:17:31] RQA_check_result -r 143 -t 'ucx_perftest ucp_put_lat'

      d)
      timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          overhead (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      [1685028032.034652] [rdma-dev-20:366801:0] perftest.c:129 UCX ERROR recv() failed: Connection reset by peer
      + [23-05-25 11:20:37] RQA_check_result -r 143 -t 'ucx_perftest ucp_put_bw'

      e)
      timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          latency (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      [1685028217.200962] [rdma-dev-20:366878:0] perftest.c:129 UCX ERROR recv() failed: Connection reset by peer
      + [23-05-25 11:23:42] RQA_check_result -r 143 -t 'ucx_perftest ucp_get'

      Expected results:

      The following are from RHEL-8.8.0-20230228.22 build testing.

      a)
      + [23-03-02 17:17:44] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_lat -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          latency (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      [thread 0] 598721 0.828 0.835 0.835 9.14 9.14 1197441 1197441
      Final: 1000000 0.827 0.835 0.835 9.14 9.14 1197835 1197599
      + [23-03-02 17:17:46] RQA_check_result -r 0 -t 'ucx_perftest tag_lat'

      b)

      + [23-03-02 17:17:52] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t tag_bw -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          overhead (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      Final: 1000000 0.068 0.098 0.098 78.11 78.11 10237401 10237401
      + [23-03-02 17:17:52] RQA_check_result -r 0 -t 'ucx_perftest tag_bw'

      c)

      + [23-03-02 17:17:57] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_lat -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          latency (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      [thread 0] 593097 0.841 0.843 0.843 9.05 9.05 1186195 1186195
      Final: 1000000 0.841 0.843 0.843 9.05 9.05 1185993 1186113
      + [23-03-02 17:17:59] RQA_check_result -r 0 -t 'ucx_perftest ucp_put_lat'

      d)

      + [23-03-02 17:18:04] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_put_bw -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          overhead (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      Final: 1000000 0.068 0.094 0.094 80.87 80.87 10599975 10599975
      + [23-03-02 17:18:05] RQA_check_result -r 0 -t 'ucx_perftest ucp_put_bw'

      e)

      + [23-03-02 17:18:10] timeout --preserve-status --kill-after=5m 3m ucx_perftest -t ucp_get -c 1 172.31.45.119
      ----------------------------------------------------------------------------------------------

          latency (usec) bandwidth (MB/s) message rate (msg/s)

      ----------------------------------------------------------------------------------

      Stage
      1. iterations
      50.0%ile average overall average overall average overall

      ----------------------------------------------------------------------------------
      [thread 0] 610706 1.614 1.637 1.637 4.66 4.66 610706 610706
      Final: 1000000 1.602 1.638 1.638 4.66 4.66 610552 610646
      + [23-03-02 17:18:12] RQA_check_result -r 0 -t 'ucx_perftest ucp_get'

      Additional info:

              network-drivers-bugs@redhat.com network-drivers-bugs group
              bchae Brian Chae
              RH Bugzilla Integration RH Bugzilla Integration
              infiniband-qe infiniband-qe infiniband-qe infiniband-qe
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: