-
Bug
-
Resolution: Done
-
Critical
-
None
-
None
-
None
Description of problem:
On ppc64le LPARs that don't have a `node0` present, hwloc topology discovery will fail. Not all ppc64le LPARs will have a node0 present in the /sys tree. Here is the original ticket for the backport: https://issues.redhat.com/browse/RHEL-118677
Version numbers (base image, wheels, builder, etc):
Have: hwloc 2.4.1-5
Need: hwloc 2.4.1-6
Steps to Reproduce:
1. run container on ppc64le hardware with spyre cards where `node0` is not present in the /sys tree.
Actual results:
podman exec -it 4e006576fdb6 bash
---- IBM AIU Device Discovery...
Topology does not contain any NUMA node, aborting!
---- --> Detecting PCIe devices...
---- --> Detected: 0 AIU PFs, 0 AIU VFs, 0 NICs, 0 NVMEs
aiu-discover-topo: /project_src/aiu-toolbox/aiu-discovery/aiu-discover-topo.cpp:392: int main(int, char**): Assertion `server_name' failed.
Signal Received: 6 (Aborted)
Signal Received from pid=189
*** BACKTRACE ***
/opt/ibm/spyre/senlib/lib/libsenlib-dd2.so(+0x6fe0d0)[0x7fff8495e0d0]
linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fff86060464]
/lib64/glibc-hwcaps/power10/libc.so.6(+0xa58a8)[0x7fff815658a8]
/lib64/glibc-hwcaps/power10/libc.so.6(gsignal+0x20)[0x7fff81508360]
/lib64/glibc-hwcaps/power10/libc.so.6(abort+0x134)[0x7fff814ea574]
/lib64/glibc-hwcaps/power10/libc.so.6(+0x3dfd8)[0x7fff814fdfd8]
/lib64/glibc-hwcaps/power10/libc.so.6(__assert_fail+0x58)[0x7fff814fe048]
/opt/ibm/spyre/bin/aiu-discover-topo[0x10013194]
/lib64/glibc-hwcaps/power10/libc.so.6(+0x2a944)[0x7fff814ea944]
/lib64/glibc-hwcaps/power10/libc.so.6(__libc_start_main+0x13c)[0x7fff814eab0c]
*****************
---- IBM AIU Environment Setup... (Generate environment from existing config)
---- IBM AIU Device Access: PF (config)
Traceback (most recent call last):
File "/opt/ibm/spyre/bin/aiu-assign-ranks.py", line 161, in <module>
with open(pargs.topo) as fd:
^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/tmp/etc/ibm/spyre/topo.json'
bash: /tmp/etc/ibm/spyre/aiu_env.sh: No such file or directory
Important log from hwloc:
Topology does not contain any NUMA node, aborting!
From this system (notice node0 is not present):
[root@ltcever87-lp19 ~]# ls -al /sys/devices/system/node/ total 0 drwxr-xr-x. 6 root root 0 Nov 28 20:49 . drwxr-xr-x. 11 root root 0 Nov 28 20:49 .. -r--r--r--. 1 root root 65536 Dec 3 01:19 has_cpu -r--r--r--. 1 root root 65536 Dec 3 01:19 has_generic_initiator -r--r--r--. 1 root root 65536 Dec 3 01:19 has_memory -r--r--r--. 1 root root 65536 Dec 3 01:19 has_normal_memory drwxr-xr-x. 5 root root 0 Nov 28 20:49 node4 drwxr-xr-x. 5 root root 0 Nov 28 20:49 node6 drwxr-xr-x. 5 root root 0 Nov 28 20:49 node7 -r--r--r--. 1 root root 65536 Dec 3 01:19 online -r--r--r--. 1 root root 65536 Dec 3 01:19 possible drwxr-xr-x. 2 root root 0 Dec 1 18:55 power -rw-r--r--. 1 root root 65536 Nov 28 20:49 uevent
Expected results:
Container starts up without error
Additional info:
This can be resolved either with the backport in ticket, ttps://issues.redhat.com/browse/RHEL-118677, OR by using hwloc >= 2.5.0
- is blocked by
-
AIPCC-7780 ppc64le runners are down
-
- Closed
-
- is related to
-
RHEL-118677 RHEL9.6 hwloc support upgrade
-
- Closed
-
- mentioned on