-
Bug
-
Resolution: Won't Do
-
Undefined
-
None
-
rhel-7.9.z
-
None
-
Moderate
-
rhel-sst-network-drivers
-
ssg_networking
-
None
-
False
-
-
None
-
None
-
None
-
None
-
If docs needed, set a value
-
-
x86_64
-
None
Description of problem:
When using ucx-1.5.2 on a system with more than 256 logical CPUs, Intel MPI 2019.9 on Infiniband, there is a fatal error.
The customer has been provided an engineering build of ucx-1.7.0 and the issue does not occur.
Version-Release number of selected component (if applicable):
ucx-1.5.2 has the issue.
ucx-1.7.0 does not.
How reproducible:
From the case:
"""
using 260 cores fails - I attached a strace of this call:
(0) [user@iclcj110 ~]$ np=260; mpirun -np $np /jobman/tmp1/fkf_transfer/impi-2019.9.304/intel64/bin/IMB-MPI1 -npmin $np Sendrecv
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
[1619101757.193516] [iclcj110:19268:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
[1619101757.193530] [iclcj110:19438:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
[1619101757.193537] [iclcj110:19493:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
[1619101757.193517] [iclcj110:19501:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
[1619101757.193533] [iclcj110:19267:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
Abort(1091215) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(136)........:
MPID_Init(1149)..............:
MPIDI_OFI_mpi_init_hook(1657): OFI get address vector map failed
[1619101757.193590] [iclcj110:19279:0] select.c:406 UCX ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
Abort(1091215) on node 13 (rank 13 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
"""
When using ucx-1.7.0, the issue does not occur.
"""
(0) [user@node1 ~]$ np=260; mpirun -np $np /jobman/tmp1/fkf_transfer/impi-2019.9.304/intel64/bin/IMB-MPI1 -npmin $np Sendrecv
[0] MPI startup(): Intel(R) MPI Library, Version 2019 Update 9 Build 20200923 (id: abd58e492)
[0] MPI startup(): Copyright (C) 2003-2020 Intel Corporation. All rights reserved.
[0] MPI startup(): library kind: release
[0] MPI startup(): libfabric version: 1.10.1-impi
[0] MPI startup(): libfabric provider: mlx
#------------------------------------------------------------
- Intel(R) MPI Benchmarks 2019 Update 7, MPI-1 part
#------------------------------------------------------------ - Date : Thu Apr 22 16:46:23 2021
- Machine : x86_64
- System : Linux
- Release : 3.10.0-1160.11.1.el7.x86_64
- Version : #1 SMP Mon Nov 30 13:05:31 EST 2020
- MPI Version : 3.1
- MPI Thread Environment:
- Calling sequence was:
- /jobman/tmp1/fkf_transfer/impi-2019.9.304/intel64/bin/IMB-MPI1 -npmin 260 Sendrecv
- Minimum message length in bytes: 0
- Maximum message length in bytes: 4194304
# - MPI_Datatype : MPI_BYTE
- MPI_Datatype for reductions : MPI_FLOAT
- MPI_Op : MPI_SUM
#
"""
Steps to Reproduce:
1. Use a system with more than 256 cores.
2. Use ucx-1.5.2
3.
Actual results:
Questions:
1) Is ucx-1.7.0 going to be released officially for RHEL 7?
2) Is it better to review ucx-1.5.2 for the bug?
3) How can I provide the strace output, as it is 92 MiB and too large to upload to the bz directly.
Expected results:
Additional info:
The customer provided a power point with more info.
They also provided the strace output when the failure occurs - but it is 92 MiB.
- external trackers