Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6168

RHEL-8.7 ucx test 'openmpi ucx osu_bw' fail

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • rhel-8.7.0
    • ucx
    • None
    • None
    • rhel-sst-network-drivers
    • ssg_networking
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None

      Description of problem:
      On RHEL-8.7.0, 'openmpi ucx osu_bw' test of our ucx test on hosts with Mellanox mlx4 MT27520 CX-3Pro failed as shown in Actual results section. The failure occurred when running test for RoCE fabric.

      Version-Release number of selected component (if applicable):
      DISTRO=RHEL-8.7.0-20220817.0
      Red Hat Enterprise Linux release 8.7 Beta (Ootpa)
      4.18.0-418.el8.x86_64
      rdma-core-41.0-1.el8.x86_64
      linux-firmware-20220726-110.git150864a4.el8.noarch
      + [22-08-18 10:00:53] tail /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
      ==> /sys/class/infiniband/mlx5_0/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_1/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
      14.32.1010
      + [22-08-18 10:00:53] lspci
      + [22-08-18 10:00:53] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
      04:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
      05:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
      05:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]

      How reproducible:
      Seen it only once so far.

      Steps to Reproduce:
      1. Install RHEL-8.7.0-20220817.0 on rdma-virt-02/03
      2. Install & execute kernel-kernel-infiniband-ucx test script
      3. Watch ucx result on client side

      Actual results:
      + [22-08-18 10:06:14] timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl '^vader,tcp,openib' -mca btl_openib_cpc_include rdmacm -mca btl_openib_receive_queues P,65536,256,192,128 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=mlx5_bond_0:1 mpitests-osu_bw

      1. OSU MPI Bandwidth Test v5.8
      2. Size Bandwidth (MB/s)
        [rdma-virt-02:219082:0:219082] ib_mlx5_log.c:177 Transport retry count exceeded on mlx5_bond_0:1/RoCE (synd 0x15 vend 0x81 hw_synd 0/0)
        [rdma-virt-02:219082:0:219082] ib_mlx5_log.c:177 RC QP 0x1379 wqe[0]: SEND --e [inl len 10] [rqpn 0x1379 dlid=0 sl=0 port=1 src_path_bits=0 dgid=::ffff:172.31.40.203 sgid_index=7 traffic_class=0]
        ==== backtrace (tid: 219082) ====
        0 /lib64/libucs.so.0(ucs_handle_error+0x2dc) [0x15108a68cedc]
        1 /lib64/libucs.so.0(ucs_fatal_error_message+0xb1) [0x15108a689d41]
        2 /lib64/libucs.so.0(ucs_log_default_handler+0xde4) [0x15108a68e6a4]
        3 /lib64/libucs.so.0(ucs_log_dispatch+0xe4) [0x15108a68e9c4]
        4 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a) [0x15108a40259a]
        5 /lib64/ucx/libuct_ib.so.0(+0x3c480) [0x15108a419480]
        6 /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d) [0x15108a40403d]
        7 /lib64/ucx/libuct_ib.so.0(+0x3a48a) [0x15108a41748a]
        8 /lib64/libucp.so.0(ucp_worker_progress+0x2a) [0x15108ad53ada]
        9 /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_progress+0x34) [0x1510a07f2f94]
        10 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_request_default_wait+0x12d) [0x1510a1e9659d]
        11 /usr/lib64/openmpi/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0x103) [0x1510a1f02643]
        12 /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Barrier+0xb0) [0x1510a1eadb70]
        13 mpitests-osu_bw(+0x1fd0) [0x55d7f1079fd0]
        14 /lib64/libc.so.6(__libc_start_main+0xe5) [0x1510a0f6ad85]
        15 mpitests-osu_bw(+0x25de) [0x55d7f107a5de]
        =================================
        [rdma-virt-02:219082] *** Process received signal ***
        [rdma-virt-02:219082] Signal: Aborted (6)
        [rdma-virt-02:219082] Signal code: (-6)
        [rdma-virt-02:219082] [ 0] /lib64/libpthread.so.0(+0x12cf0)[0x1510a1308cf0]
        [rdma-virt-02:219082] [ 1] /lib64/libc.so.6(gsignal+0x10f)[0x1510a0f7eaff]
        [rdma-virt-02:219082] [ 2] /lib64/libc.so.6(abort+0x127)[0x1510a0f51ea5]
        [rdma-virt-02:219082] [ 3] /lib64/libucs.so.0(+0x27d46)[0x15108a689d46]
        [rdma-virt-02:219082] [ 4] /lib64/libucs.so.0(ucs_log_default_handler+0xde4)[0x15108a68e6a4]
        [rdma-virt-02:219082] [ 5] /lib64/libucs.so.0(ucs_log_dispatch+0xe4)[0x15108a68e9c4]
        [rdma-virt-02:219082] [ 6] /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_completion_with_err+0x27a)[0x15108a40259a]
        [rdma-virt-02:219082] [ 7] /lib64/ucx/libuct_ib.so.0(+0x3c480)[0x15108a419480]
        [rdma-virt-02:219082] [ 8] /lib64/ucx/libuct_ib.so.0(uct_ib_mlx5_check_completion+0x4d)[0x15108a40403d]
        [rdma-virt-02:219082] [ 9] /lib64/ucx/libuct_ib.so.0(+0x3a48a)[0x15108a41748a]
        [rdma-virt-02:219082] [10] /lib64/libucp.so.0(ucp_worker_progress+0x2a)[0x15108ad53ada]
        [rdma-virt-02:219082] [11] /usr/lib64/openmpi/lib/libopen-pal.so.40(opal_progress+0x34)[0x1510a07f2f94]
        [rdma-virt-02:219082] [12] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_request_default_wait+0x12d)[0x1510a1e9659d]
        [rdma-virt-02:219082] [13] /usr/lib64/openmpi/lib/libmpi.so.40(ompi_coll_base_barrier_intra_recursivedoubling+0x103)[0x1510a1f02643]
        [rdma-virt-02:219082] [14] /usr/lib64/openmpi/lib/libmpi.so.40(MPI_Barrier+0xb0)[0x1510a1eadb70]
        [rdma-virt-02:219082] [15] mpitests-osu_bw(+0x1fd0)[0x55d7f1079fd0]
        [rdma-virt-02:219082] [16] /lib64/libc.so.6(__libc_start_main+0xe5)[0x1510a0f6ad85]
        [rdma-virt-02:219082] [17] mpitests-osu_bw(+0x25de)[0x55d7f107a5de]
        [rdma-virt-02:219082] *** End of error message ***
        --------------------------------------------------------------------------
        Primary job terminated normally, but 1 process returned
        a non-zero exit code. Per user-direction, the job has been aborted.
        --------------------------------------------------------------------------
        --------------------------------------------------------------------------
        mpirun noticed that process rank 1 with PID 219082 on node 172.31.45.202 exited on signal 6 (Aborted).
        --------------------------------------------------------------------------
        + [22-08-18 10:06:32] RQA_check_result -r 134 -t 'openmpi ucx osu_bw'
        + [22-08-18 10:06:32] local test_pass=0
        + [22-08-18 10:06:32] local test_skip=777
        + [22-08-18 10:06:32] test 4 -gt 0
        + [22-08-18 10:06:32] case $1 in
        + [22-08-18 10:06:32] local rc=134
        + [22-08-18 10:06:32] shift
        + [22-08-18 10:06:32] shift
        + [22-08-18 10:06:32] test 2 -gt 0
        + [22-08-18 10:06:32] case $1 in
        + [22-08-18 10:06:32] local 'msg=openmpi ucx osu_bw'
        + [22-08-18 10:06:32] shift
        + [22-08-18 10:06:32] shift
        + [22-08-18 10:06:32] test 0 -gt 0
        + [22-08-18 10:06:32] '[' -z 134 -o -z 'openmpi ucx osu_bw' ']'
        + [22-08-18 10:06:32] '[' -z /tmp/tmp.LwXAyOokgN/results_ucx-ucx-.txt ']'
        + [22-08-18 10:06:32] '[' -z /tmp/tmp.LwXAyOokgN/results_ucx-ucx-.txt ']'
        + [22-08-18 10:06:32] '[' 134 -eq 0 ']'
        + [22-08-18 10:06:32] '[' 134 -eq 777 ']'
        + [22-08-18 10:06:32] local test_result=FAIL
        + [22-08-18 10:06:32] export result=FAIL
        + [22-08-18 10:06:32] result=FAIL
        + [22-08-18 10:06:32] [[ ! -z '' ]]
        + [22-08-18 10:06:32] printf '%10s | %6s | %s\n' FAIL 134 'openmpi ucx osu_bw'
        + [22-08-18 10:06:32] set +x
      • TEST RESULT FOR ucx
      • Test: openmpi ucx osu_bw
      • Result: FAIL
      • Return: 134

      Expected results:
      Test to complete successfully.

      Additional info:

              mschmidt@redhat.com Michal Schmidt
              tmichael@redhat.com Afom Michael
              Michal Schmidt Michal Schmidt
              Afom Michael Afom Michael
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: