Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6159

[RHEL9.1] all mvapich2 benchmarks timeout when run with "mpirun_rsh" command

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • rhel-9.1.0
    • mvapich2
    • None
    • None
    • rhel-net-drivers
    • ssg_networking
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • 57,005

      Description of problem:

      all mvapich2 benchmarks fail with timeouts when run with "mpirun_rsh" command, on HFI OPA device; while all benchmarks run successfully when run with "mpirun" command. This is a REGRESSION from RHEL-9.1.0-20220524.0 build, where NO SUCH failures observed.

      Version-Release number of selected component (if applicable):

      Clients: rdma-qe-15
      Servers: rdma-qe-14

      DISTRO=RHEL-9.1.0-20220718.0

      + [22-07-20 23:40:51] cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.1 Beta (Plow)

      + [22-07-20 23:40:51] uname -a
      Linux rdma-qe-15.rdma.lab.eng.rdu2.redhat.com 5.14.0-130.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri Jul 15 08:52:03 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

      + [22-07-20 23:40:51] cat /proc/cmdline
      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-130.el9.x86_64 root=UUID=abd4cfd1-c58e-4c29-836b-d0f6ee695d89 ro intel_idle.max_cstate=0 processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH intel_iommu=on crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=UUID=95f34601-5d88-4b38-bff8-afc96012d612 console=ttyS1,115200

      + [22-07-20 23:40:51] rpm -q rdma-core linux-firmware
      rdma-core-37.2-1.el9.x86_64
      linux-firmware-20220509-126.el9.noarch

      + [22-07-20 23:40:51] tail /sys/class/infiniband/hfi1_0/fw_ver
      1.27.0
      + [22-07-20 23:40:51] lspci
      + [22-07-20 23:40:51] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.0 Fabric controller: Intel Corporation Omni-Path HFI Silicon 100 Series [discrete] (rev 10)

      Installed:
      mpitests-mvapich2-psm2-5.8-1.el9.x86_64 mvapich2-psm2-2.3.6-3.el9.x86_64

      How reproducible:

      100%

      Steps to Reproduce:
      1. bring up the RDMA hosts mentioned above with RHEL9.1 build
      2. set up RDMA hosts for mvapich2 benchamrk tests
      3. run one of the mvapich2 benchmark with "mpirun_rsh" command, as the following:

      timeout --preserve-status --kill-after=5m 3m mpirun_rsh -np 2 -hostfile /root/hfile_one_core PSM2_PKEY=0x8001 mpitests-IMB-MPI1 PingPong -time 1.5

      Actual results:

      [rdma-qe-14.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][mv2_psm_err_handler] PSM error handler: Operation timed out : Detected connection timeout: rdma-qe-15
      psm_ep_connect failed with error Operation timed out
      [rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][mv2_psm_err_handler] PSM error handler: Operation timed out : Detected connection timeout: LID=6:7.0
      psm_ep_connect failed with error Operation timed out
      [rdma-qe-14.rdma.lab.eng.rdu2.redhat.com:mpi_rank_0][psm_connect_alltoall] psm_connect_alltoall failed
      [rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpi_rank_1][psm_connect_alltoall] psm_connect_alltoall failed
      [rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job
      [rdma-qe-15.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][signal_processor] Caught signal 15, killing job
      + [22-07-21 03:20:02] __MPI_check_result 1 mpitests-mvapich2-psm2 IMB-MPI1 PingPong mpirun_rsh /root/hfile_one_core

      Expected results:

      Normal execution of the benchmarks with stats output

      Additional info:

      This same issue exists on DISTRO=RHEL-9.1.0-20220710.3 build, as well

              kheib Kamal Heib
              bchae Brian Chae (Inactive)
              infiniband-qe infiniband-qe infiniband-qe infiniband-qe
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: