Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6128

[RHEL8.8] all mvapich2 benchmarks fail when run on MLX5 IB0 or IB1 on MT27700 CX-4

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Yes
    • None
    • 1
    • rhel-net-drivers
    • ssg_networking
    • 1
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • Network Drivers 6
    • None
    • None
    • If docs needed, set a value
    • None
    • 57,005

      +++ This bug was initially created as a clone of Bug #2148553 +++

      Description of problem:

      All mvapich2 benchmarks fail with RC134, with "mpirun" command, or RC1, with "mpirun_rsh" command. This happens on a host with MT27700 CX-4 device and the transport is IB0 or IB1.

      However, this takes place specifically on rdma-dev-19 / rdma-dev-20 host pairs when run in RDMA server & client, respectively.

      This is a REGRESSION from RHEL-8.7.0 the mvapich2 on IB0 on the same HCA on rdma-dev-19 / rdma-dev-20, where all benchmarks PASSED

      Version-Release number of selected component (if applicable):

      Clients: rdma-dev-20
      Servers: rdma-dev-19

      DISTRO=RHEL-8.8.0-20221120.2

      + [22-11-25 16:18:29] cat /etc/redhat-release
      Red Hat Enterprise Linux release 8.8 Beta (Ootpa)

      + [22-11-25 16:18:29] uname -a
      Linux rdma-dev-20.rdma.lab.eng.rdu2.redhat.com 4.18.0-438.el8.x86_64 #1 SMP Mon Nov 14 13:08:07 EST 2022 x86_64 x86_64 x86_64 GNU/Linux

      + [22-11-25 16:18:29] cat /proc/cmdline
      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-4.18.0-438.el8.x86_64 root=UUID=4dcc79ce-c280-4af4-9b75-02011855b115 ro intel_idle.max_cstate=0 processor.max_cstate=0 intel_iommu=on iommu=on console=tty0 rd_NO_PLYMOUTH crashkernel=auto resume=UUID=1c9d8b9c-d969-417d-ad02-b9e6279dfac8 console=ttyS1,115200n81

      + [22-11-25 16:18:29] rpm -q rdma-core linux-firmware
      rdma-core-41.0-1.el8.x86_64
      linux-firmware-20220726-110.git150864a4.el8.noarch

      + [22-11-25 16:18:29] tail /sys/class/infiniband/mlx5_2/fw_ver /sys/class/infiniband/mlx5_3/fw_ver /sys/class/infiniband/mlx5_bond_0/fw_ver
      ==> /sys/class/infiniband/mlx5_2/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_3/fw_ver <==
      12.28.2006

      ==> /sys/class/infiniband/mlx5_bond_0/fw_ver <==
      14.31.1014
      + [22-11-25 16:18:29] lspci
      + [22-11-25 16:18:29] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      01:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      01:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
      04:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx]
      82:00.0 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]
      82:00.1 Infiniband controller: Mellanox Technologies MT27700 Family [ConnectX-4]

      Installed:
      mpitests-mvapich2-5.8-1.el8.x86_64 mvapich2-2.3.6-1.el8.x86_64

      How reproducible:

      100%

      Steps to Reproduce:
      1. bring up the RDMA hosts mentioned above with RHEL8.8 build
      2. set up RDMA hosts for mvapich2 benchamrk tests
      3. run one of the mvapich2 benchmark with "mpirun" command, as the following:

      a) mpirun command

      timeout --preserve-status --kill-after=5m 3m mpirun -hostfile /root/hfile_one_core -np 2 mpitests-IMB-MPI1 PingPong -time 1.5

          • buffer overflow detected ***: terminated
            [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)

      ===================================================================================
      = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
      = PID 48458 RUNNING AT 172.31.0.120
      = EXIT CODE: 134
      = CLEANING UP REMAINING PROCESSES
      = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
      ===================================================================================
      [proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:911): assert (!closed) failed
      [proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
      [proxy:0:0@rdma-dev-19.rdma.lab.eng.rdu2.redhat.com] main (pm/pmiserv/pmip.c:202): demux engine error waiting for event
      YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Aborted (signal 6)
      This typically refers to a problem with your application.
      Please see the FAQ page for debugging suggestions

      b) "mpirun_rsh" command

      + [22-11-25 14:26:27] timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core mpitests-IMB-MPI1 PingPong -time 1.5

          • buffer overflow detected ***: terminated
            [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)
          • buffer overflow detected ***: terminated
            [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][error_sighandler] Caught error: Aborted (signal 6)
            [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][readline] Unexpected End-Of-File on file descriptor 5. MPI process died?
            [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][mtpmi_processops] Error while reading PMI socket. MPI process died?
            [rdma-dev-19.rdma.lab.eng.rdu2.redhat.com:mpispawn_0][child_handler] MPI process (rank: 0, pid: 51624) terminated with signal 6 -> abort job
            [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][readline] Unexpected End-Of-File on file descriptor 6. MPI process died?
            [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][mtpmi_processops] Error while reading PMI socket. MPI process died?
            [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpispawn_1][child_handler] MPI process (rank: 1, pid: 52467) terminated with signal 6 -> abort job
            [rdma-dev-20.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][process_mpispawn_connection] mpispawn_0 from node 172.31.0.119 aborted: Error while reading a PMI socket (4)
            + [22-11-25 14:26:30] __MPI_check_result 1 mpitests-mvapich2 IMB-MPI1 PingPong mpirun_rsh /root/hfile_one_core

      Actual results:

      Expected results:

      Normal run with stats

      Additional info:

      On other hosts, like rdma-dev-21 and rdma-dev-22 pair, with the same MT27700 CX-4 device, with IB0, all mvapich2 benchmarks PASSED. Also, on rdma-perf-02/03 host pair, with mlx5 MT27800 CX-5 ib0, all mvapich2 benchmarks PASSED.

              kheib Kamal Heib
              bchae Brian Chae (Inactive)
              Kamal Heib Kamal Heib
              infiniband-qe infiniband-qe infiniband-qe infiniband-qe
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: