Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6187

[RHEL8.7] OSU acc_latency fails when openmpi benchmarks run on QEDR ROCE device

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • rhel-8.7.0
    • openmpi
    • None
    • None
    • 1
    • rhel-net-drivers
    • ssg_networking
    • 1
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • Network Drivers 6
    • None
    • None
    • If docs needed, set a value
    • None
    • 57,005

      Description of problem:

      OSU acc_latency benchmark fails with following error message:

      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)

      Version-Release number of selected component (if applicable):

      Clients: rdma-dev-02
      Servers: rdma-perf-06

      DISTRO=RHEL-8.7.0-20220524.0

      + [22-05-26 02:08:38] cat /etc/redhat-release
      Red Hat Enterprise Linux release 8.7 Beta (Ootpa)

      + [22-05-26 02:08:38] uname -a
      Linux rdma-dev-02.rdma.lab.eng.rdu2.redhat.com 4.18.0-393.el8.x86_64 #1 SMP Wed May 18 12:44:50 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

      + [22-05-26 02:08:38] cat /proc/cmdline
      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-393.el8.x86_64 root=UUID=fd7a6a9d-cd42-4b62-9933-1f5f3d4c927b ro console=tty0 rd_NO_PLYMOUTH intel_iommu=on iommu=on crashkernel=auto resume=UUID=9ea769dc-0bb3-455f-a1b3-d99cd5d33215 console=ttyS1,115200

      + [22-05-26 02:08:38] rpm -q rdma-core linux-firmware
      rdma-core-37.2-1.el8.x86_64
      linux-firmware-20220210-107.git6342082c.el8.noarch

      + [22-05-26 02:08:38] tail /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
      ==> /sys/class/infiniband/qedr0/fw_ver <==
      8. 59. 1. 0

      ==> /sys/class/infiniband/qedr1/fw_ver <==
      8. 59. 1. 0
      + [22-05-26 02:08:38] lspci
      + [22-05-26 02:08:38] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      08:00.0 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)
      08:00.1 Ethernet controller: QLogic Corp. FastLinQ QL45000 Series 25GbE Controller (rev 10)

      Installed:
      mpitests-openmpi-5.8-1.el8.x86_64 openmpi-1:4.1.1-3.el8.x86_64
      openmpi-devel-1:4.1.1-3.el8.x86_64

      How reproducible:
      100%

      Steps to Reproduce:
      1. With the above build on qedr roce device
      2. set up both RDMA server and client for openmpi
      3. On the client side, run the following benchmark command

      imeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 --allow-run-as-root --map-by node -mca btl_openib_warn_nonexistent_if 0 -mca btl_openib_if_include qedr0:1 -mca mtl '^psm2,psm,ofi' -mca btl '^openib' --mca mtl_base_verbose 100 --mca btl_openib_verbose 100 -mca pml ucx -mca osc ucx -x UCX_NET_DEVICES=qede_roce.45 --mca osc_ucx_verbose 100 --mca pml_ucx_verbose 100 /usr/lib64/openmpi/bin/mpitests-osu_acc_latency

      Actual results:

      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to create UD QP on qedr0
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0: PSM3 can't open nic unit: 0 (err=23)
      rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:rank0.mpitests-osu_acc_latency: Unable to initialize verbs
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:289 mca_pml_ucx_init
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:114 Pack remote worker address, size 38
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:114 Pack local worker address, size 141
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:351 created ucp context 0x56170ef84000, worker 0x56170efd7e50
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 22
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      [create_qp:2753]create qp: failed on ibv_cmd_create_qp with 95
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.11.2
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:289 mca_pml_ucx_init
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:114 Pack remote worker address, size 38
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:114 Pack local worker address, size 141
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:351 created ucp context 0x55e45dfd7160, worker 0x55e45e524ca0
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:182 Got proc 0 address, size 141
      [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:411 connecting to proc. 0
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:182 Got proc 1 address, size 141
      [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:411 connecting to proc. 1

      1. OSU MPI_Accumulate latency Test v5.8
      2. Window creation: MPI_Win_allocate
      3. Synchronization: MPI_Win_flush
      4. Size Latency (us)
        [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:182 Got proc 0 address, size 38
        [rdma-perf-06.rdma.lab.eng.rdu2.redhat.com:85543] pml_ucx.c:411 connecting to proc. 0
        [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:182 Got proc 1 address, size 38
        [rdma-dev-02.rdma.lab.eng.rdu2.redhat.com:71426] pml_ucx.c:411 connecting to proc. 1
        1 2570.11
        2 2570.11
        4 2570.11
        8 2570.11
        16 2570.18
        32 2570.10
        + [22-05-26 02:41:36] __MPI_check_result 1 mpitests-openmpi OSU /usr/lib64/openmpi/bin/mpitests-osu_acc_latency mpirun /root/hfile_one_core

      Expected results:

      Normal execution with proper stats output

      Additional info:

              kheib Kamal Heib
              bchae Brian Chae (Inactive)
              Kamal Heib Kamal Heib
              infiniband-qe infiniband-qe infiniband-qe infiniband-qe
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: