Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-9396

ucx-1.5.2 failure when using more than 256 logical CPUs

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • rhel-7.9.z
    • ucx
    • Normal
    • sst_network_drivers
    • ssg_networking
    • False
    • Hide

      None

      Show
      None
    • If docs needed, set a value

      Description of problem:

      When using ucx-1.5.2 on a system with more than 256 logical CPUs, Intel MPI 2019.9 on Infiniband, there is a fatal error.

      The customer has been provided an engineering build of ucx-1.7.0 and the issue does not occur.

      Version-Release number of selected component (if applicable):

      ucx-1.5.2 has the issue.
      ucx-1.7.0 does not.

      How reproducible:

      From the case:

      """
      using 260 cores fails - I attached a strace of this call:
      (0) [user@iclcj110 ~]$ np=260; mpirun -np $np /jobman/tmp1/fkf_transfer/impi-2019.9.304/intel64/bin/IMB-MPI1 -npmin $np Sendrecv
      [0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
      [0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
      [0] MPI startup(): library kind: release
      [0] MPI startup(): libfabric version: 1.10.1-impi
      [0] MPI startup(): libfabric provider: mlx
      [1619101757.193516] [iclcj110:19268:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
      [1619101757.193530] [iclcj110:19438:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
      [1619101757.193537] [iclcj110:19493:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
      [1619101757.193517] [iclcj110:19501:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
      [1619101757.193533] [iclcj110:19267:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
      Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
      MPIR_Init_thread(136)........:
      MPID_Init(1149)..............:
      MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
      Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
      MPIR_Init_thread(136)........:
      MPID_Init(1149)..............:
      MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
      [1619101757.193590] [iclcj110:19279:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
      Abort(1091215) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
      """

      When using ucx-1.7.0, the issue does not occur.

      """
      (0) [user@node1 ~]$ np=260; mpirun -np $np /jobman/tmp1/fkf_transfer/impi-2019.9.304/intel64/bin/IMB-MPI1 -npmin $np Sendrecv
      [0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
      [0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
      [0] MPI startup(): library kind: release
      [0] MPI startup(): libfabric version: 1.10.1-impi
      [0] MPI startup(): libfabric provider: mlx
      #------------------------------------------------------------

      1. Intel(R) MPI Benchmarks 2019 Update 7, MPI-1 part
        #------------------------------------------------------------
      2. Date : Thu Apr 22 16:46:23 2021
      3. Machine : x86_64
      4. System : Linux
      5. Release : 3.10.0-1160.11.1.el7.x86_64
      6. Version : #1 SMP Mon Nov 30 13:05:31 EST 2020
      7. MPI Version : 3.1
      8. MPI Thread Environment:
      1. Calling sequence was:
      1. /jobman/tmp1/fkf_transfer/impi-2019.9.304/intel64/bin/IMB-MPI1 -npmin 260 Sendrecv
      1. Minimum message length in bytes: 0
      2. Maximum message length in bytes: 4194304
        #
      3. MPI_Datatype : MPI_BYTE
      4. MPI_Datatype for reductions : MPI_FLOAT
      5. MPI_Op : MPI_SUM
        #
        """

      Steps to Reproduce:
      1. Use a system with more than 256 cores.
      2. Use ucx-1.5.2
      3.

      Actual results:

      Questions:

      1) Is ucx-1.7.0 going to be released officially for RHEL 7?

      2) Is it better to review ucx-1.5.2 for the bug?

      3) How can I provide the strace output, as it is 92 MiB and too large to upload to the bz directly.

      Expected results:

      Additional info:

      The customer provided a power point with more info.
      They also provided the strace output when the failure occurs - but it is 92 MiB.

            kheib Kamal Heib
            rhn-support-soakley Lucas Oakley
            Afom Michael Afom Michael
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: