Uploaded image for project: 'AI Platform Core Components'
  1. AI Platform Core Components
  2. AIPCC-7702

required hwloc backport for didn't make it into UBI9.6 base image

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • Accelerator Enablement
    • None
    • AIPCC Accelerators 21
    • Critical

      Description of problem:

      On ppc64le LPARs that don't have a `node0` present, hwloc topology discovery will fail. 
      
      Not all ppc64le LPARs will have a node0 present in the /sys tree. 
      
      Here is the original ticket for the backport: https://issues.redhat.com/browse/RHEL-118677 

      Version numbers (base image, wheels, builder, etc):

          Have: hwloc 2.4.1-5
          Need: hwloc 2.4.1-6

      Steps to Reproduce:

          1. run container on ppc64le hardware with spyre cards where `node0` is not present in the /sys tree. 
          
          

      Actual results:

          podman exec -it 4e006576fdb6 bash
      ---- IBM AIU Device Discovery...
      Topology does not contain any NUMA node, aborting!
      ---- --> Detecting PCIe devices...
      ---- --> Detected:   0 AIU PFs,   0 AIU VFs,   0 NICs,   0 NVMEs
      aiu-discover-topo: /project_src/aiu-toolbox/aiu-discovery/aiu-discover-topo.cpp:392: int main(int, char**): Assertion `server_name' failed.
      Signal Received: 6 (Aborted)
      Signal Received from pid=189
      *** BACKTRACE ***
      /opt/ibm/spyre/senlib/lib/libsenlib-dd2.so(+0x6fe0d0)[0x7fff8495e0d0]
      linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fff86060464]
      /lib64/glibc-hwcaps/power10/libc.so.6(+0xa58a8)[0x7fff815658a8]
      /lib64/glibc-hwcaps/power10/libc.so.6(gsignal+0x20)[0x7fff81508360]
      /lib64/glibc-hwcaps/power10/libc.so.6(abort+0x134)[0x7fff814ea574]
      /lib64/glibc-hwcaps/power10/libc.so.6(+0x3dfd8)[0x7fff814fdfd8]
      /lib64/glibc-hwcaps/power10/libc.so.6(__assert_fail+0x58)[0x7fff814fe048]
      /opt/ibm/spyre/bin/aiu-discover-topo[0x10013194]
      /lib64/glibc-hwcaps/power10/libc.so.6(+0x2a944)[0x7fff814ea944]
      /lib64/glibc-hwcaps/power10/libc.so.6(__libc_start_main+0x13c)[0x7fff814eab0c]
      *****************
      ---- IBM AIU Environment Setup... (Generate environment from existing config)
      ---- IBM AIU Device Access: PF (config)
      Traceback (most recent call last):
        File "/opt/ibm/spyre/bin/aiu-assign-ranks.py", line 161, in <module>
          with open(pargs.topo) as fd:
               ^^^^^^^^^^^^^^^^
      FileNotFoundError: [Errno 2] No such file or directory: '/tmp/etc/ibm/spyre/topo.json'
      bash: /tmp/etc/ibm/spyre/aiu_env.sh: No such file or directory

      Important log from hwloc:

      Topology does not contain any NUMA node, aborting!  

      From this system (notice node0 is not present):

      [root@ltcever87-lp19 ~]# ls -al /sys/devices/system/node/
      total 0
      drwxr-xr-x.  6 root root     0 Nov 28 20:49 .
      drwxr-xr-x. 11 root root     0 Nov 28 20:49 ..
      -r--r--r--.  1 root root 65536 Dec  3 01:19 has_cpu
      -r--r--r--.  1 root root 65536 Dec  3 01:19 has_generic_initiator
      -r--r--r--.  1 root root 65536 Dec  3 01:19 has_memory
      -r--r--r--.  1 root root 65536 Dec  3 01:19 has_normal_memory
      drwxr-xr-x.  5 root root     0 Nov 28 20:49 node4
      drwxr-xr-x.  5 root root     0 Nov 28 20:49 node6
      drwxr-xr-x.  5 root root     0 Nov 28 20:49 node7
      -r--r--r--.  1 root root 65536 Dec  3 01:19 online
      -r--r--r--.  1 root root 65536 Dec  3 01:19 possible
      drwxr-xr-x.  2 root root     0 Dec  1 18:55 power
      -rw-r--r--.  1 root root 65536 Nov 28 20:49 uevent 

      Expected results:

      Container starts up without error

      Additional info:
      This can be resolved either with the backport in ticket, ttps://issues.redhat.com/browse/RHEL-118677, OR by using hwloc >= 2.5.0

              spryor@redhat.com Sean Pryor
              lance_bart Lance barto
              Frank's Team
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: