Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6137

[RHEL9.2] - some openmpi benchmarks time-out with return code of 1 when executed on CXGB4 devices

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • None
    • None
    • rhel-net-drivers
    • ssg_networking
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • 57,005

      Description of problem:

      Some of the OPENMPI benchmarks time-out with RC1 when run on CXGB4 devices.
      The failed benchmarks are as the following:

      FAIL | 1 | openmpi IMB-IO P_Write_indv mpirun one_core
      FAIL | 1 | openmpi IMB-IO P_Write_expl mpirun one_core
      FAIL | 1 | openmpi IMB-IO P_Write_shared mpirun one_core
      FAIL | 1 | openmpi IMB-IO P_Write_priv mpirun one_core
      FAIL | 1 | openmpi IMB-IO C_Write_indv mpirun one_core
      FAIL | 1 | openmpi IMB-IO C_Write_expl mpirun one_core
      FAIL | 1 | openmpi IMB-IO C_Write_shared mpirun one_core
      FAIL | 1 | openmpi OSU get_acc_latency mpirun one_core
      FAIL | 1 | openmpi OSU mbw_mr mpirun one_core

      This issue seems to be consistent in the following hosts.

      a. rdma-qe-12 (cxgb4 t5 iw 40) / rdma-perf-06 (cxgb4 T6 iw 100)

      beaker job : https://beaker.engineering.redhat.com/jobs/7293293

      b. rdma-dev-13 (cxgb4 t6 iw 100) / rdma-perf-06 (cxgb4 T6 iw 100)

      https://beaker.engineering.redhat.com/jobs/7292260

      Version-Release number of selected component (if applicable):

      Clients: rdma-perf-06
      Servers: rdma-qe-12

      DISTRO=RHEL-9.2.0-20221129.2

      + [22-11-30 18:38:52] cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.2 Beta (Plow)

      + [22-11-30 18:38:52] uname -a
      Linux rdma-perf-06.rdma.lab.eng.rdu2.redhat.com 5.14.0-202.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Mon Nov 28 08:49:47 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

      + [22-11-30 18:38:52] cat /proc/cmdline
      BOOT_IMAGE=(hd0,gpt2)/vmlinuz-5.14.0-202.el9.x86_64 root=UUID=60790874-ea0a-4a35-8447-d83f2475913b ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=08d83c36-2fab-45c6-a375-8bb16849b90a console=ttyS0,115200n81

      + [22-11-30 18:38:52] rpm -q rdma-core linux-firmware
      rdma-core-41.0-3.el9.x86_64
      linux-firmware-20221012-128.el9.noarch

      + [22-11-30 18:38:52] tail /sys/class/infiniband/cxgb4_0/fw_ver /sys/class/infiniband/hfi1_0/fw_ver /sys/class/infiniband/mlx5_0/fw_ver /sys/class/infiniband/mlx5_1/fw_ver /sys/class/infiniband/qedr0/fw_ver /sys/class/infiniband/qedr1/fw_ver
      ==> /sys/class/infiniband/cxgb4_0/fw_ver <==
      1.27.0.0

      ==> /sys/class/infiniband/hfi1_0/fw_ver <==
      1.27.0

      ==> /sys/class/infiniband/mlx5_0/fw_ver <==
      20.99.5392

      ==> /sys/class/infiniband/mlx5_1/fw_ver <==
      20.99.5392

      ==> /sys/class/infiniband/qedr0/fw_ver <==
      8.59.1.0

      ==> /sys/class/infiniband/qedr1/fw_ver <==
      8.59.1.0

      + [22-11-30 18:38:52] lspci
      + [22-11-30 18:38:52] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      19:00.0 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
      19:00.1 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
      19:00.2 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
      19:00.3 Ethernet controller: QLogic Corp. FastLinQ QL41000 Series 10/25/40/50GbE Controller (rev 02)
      5e:00.0 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
      5e:00.1 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
      5e:00.2 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
      5e:00.3 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
      5e:00.4 Ethernet controller: Chelsio Communications Inc T62100-LP-CR Unified Wire Ethernet Controller
      af:00.0 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
      af:00.1 Infiniband controller: Mellanox Technologies MT28908 Family [ConnectX-6]
      d8:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 11)

      How reproducible:

      100% in the above combinations of RDMA hosts

      Steps to Reproduce:

      1. Please refer to the beaker job outputs in client hosts mentioned above.
      2.
      3.

      Actual results:

      Expected results:

      Additional info:

      However, with the following CXGB4 hosts combinations, ALL OPENMPI benchmarks PASSED

      a. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-perf-06/07 - mpich2,openmpi ]

      beaker job : https://beaker.engineering.redhat.com/jobs/7291986

      b. mpi suite over rdma-iw-cxgb pool[ RHEL-9.2.0-20221129.2: rdma-dev-13/rdma-qe-12 - mpich2,openmpi ] - J:7293324

      mpi/openmpi test results on rdma-dev-13/rdma-qe-12 & Beaker job J:7293324:
      5.14.0-202.el9.x86_64, rdma-core-41.0-3.el9, cxgb4, iw, T520-CR & cxgb4_0
      Result | Status | Test
      -------------------------------------------------
      Checking for failures and known issues:
      no test failures

      beaker job : https://beaker.engineering.redhat.com/jobs/7293324

              kheib Kamal Heib
              bchae Brian Chae (Inactive)
              Kamal Heib Kamal Heib
              infiniband-qe infiniband-qe infiniband-qe infiniband-qe
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: