Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-6189

[RHEL9.1] all OSU micro-benchmarks fail with "mpirun_rsh" command when run on BCM57508 device due to "Error in init phase""

Linking RHIVOS CVEs to...Migration: Automation ...SWIFT: POC ConversionSync from "Extern...XMLWordPrintable

    • None
    • None
    • rhel-net-drivers
    • ssg_networking
    • None
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • If docs needed, set a value
    • None
    • 57,005

      Description of problem:

      The following OSU benchmarks fails due to "Error in init phase" on BCM57508 device, when "mpirun_rsh" is used for them.

      FAIL | 1 | mvapich2 OSU acc_latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU allgather mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU allgatherv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU allreduce mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU alltoall mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU alltoallv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU barrier mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU bcast mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU bibw mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU bw mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU cas_latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU fop_latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU gather mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU gatherv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU get_acc_latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU get_bw mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU get_latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU hello mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU iallgather mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU iallgatherv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU iallreduce mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU ialltoall mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU ialltoallv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU ialltoallw mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU ibarrier mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU ibcast mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU igather mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU igatherv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU init mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU ireduce mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU iscatter mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU iscatterv mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU latency_mp mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU mbw_mr mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU multi_lat mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU put_bibw mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU put_bw mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU put_latency mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU reduce mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU reduce_scatter mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU scatter mpirun_rsh one_core
      FAIL | 1 | mvapich2 OSU scatterv mpirun_rsh one_core

      Version-Release number of selected component (if applicable):

      Clients: rdma-dev-26
      Servers: rdma-dev-25

      DISTRO=RHEL-9.1.0-20220509.3

      + [22-05-10 09:57:56] cat /etc/redhat-release
      Red Hat Enterprise Linux release 9.1 Beta (Plow)

      + [22-05-10 09:57:56] uname -a
      Linux rdma-dev-26.rdma.lab.eng.rdu2.redhat.com 5.14.0-86.el9.x86_64 #1 SMP PREEMPT_DYNAMIC Fri May 6 09:23:00 EDT 2022 x86_64 x86_64 x86_64 GNU/Linux

      + [22-05-10 09:57:56] cat /proc/cmdline
      BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-86.el9.x86_64 root=/dev/mapper/rhel_rdma-dev26-root ro intel_idle.max_cstate=0 intremap=no_x2apic_optout processor.max_cstate=0 console=tty0 rd_NO_PLYMOUTH crashkernel=1G-4G:192M,4G-64G:256M,64G:512M resume=/dev/mapper/rhel_rdma-dev-26-swap rd.lvm.lv=rhel_rdma-dev-26/root rd.lvm.lv=rhel_rdma-dev-26/swap console=ttyS1,115200n81

      + [22-05-10 09:57:56] rpm -q rdma-core linux-firmware
      rdma-core-37.2-1.el9.x86_64
      linux-firmware-20220209-126.el9_0.noarch

      + [22-05-10 09:57:56] tail /sys/class/infiniband/bnxt_re0/fw_ver /sys/class/infiniband/bnxt_re1/fw_ver
      ==> /sys/class/infiniband/bnxt_re0/fw_ver <==
      219.0.112.0

      ==> /sys/class/infiniband/bnxt_re1/fw_ver <==
      219.0.112.0

      + [22-05-10 09:57:56] lspci
      + [22-05-10 09:57:56] grep -i -e ethernet -e infiniband -e omni -e ConnectX
      02:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      02:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      03:00.1 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5720 Gigabit Ethernet PCIe
      04:00.0 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)
      04:00.1 Ethernet controller: Broadcom Inc. and subsidiaries BCM57508 NetXtreme-E 10Gb/25Gb/40Gb/50Gb/100Gb/200Gb Ethernet (rev 11)

      Installed:
      mpitests-mvapich2-5.8-1.el9.x86_64 mvapich2-2.3.6-3.el9.x86_64

      How reproducible:

      100%

      Steps to Reproduce:
      1. bring up the RDMA hosts mentioned above with RHEL8.7 build
      2. set up RDMA hosts for mvapich2 benchamrk tests
      3. run one of the mvapich2 benchmark with "mpirun_rsh" command, as the following:

      timeout --preserve-status --kill-after=5m 3m mpirun_rsh -hostfile /root/hfile_one_core -np 2 /usr/lib64/mvapich2/bin/mpitests-osu_allgatherv

      Actual results:

      [rdma-dev-26.rdma.lab.eng.rdu2.redhat.com:mpirun_rsh][child_handler] Error in init phase, aborting! (0/2 mpispawn connections)
      + [22-05-10 12:29:27] __MPI_check_result 1 mpitests-mvapich2 OSU /usr/lib64/mvapich2/bin/mpitests-osu_allgatherv mpirun_rsh /root/hfile_one_core

      Expected results:

      Normal execution of the benchmarks with stats output

      Additional info:

              kheib Kamal Heib
              bchae Brian Chae (Inactive)
              Kamal Heib Kamal Heib
              infiniband-qe infiniband-qe infiniband-qe infiniband-qe
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: