-
Bug
-
Resolution: Unresolved
-
Undefined
-
rhel-9.6
-
None
-
No
-
Low
-
rhel-arch-hw
-
2
-
False
-
False
-
-
None
-
None
-
None
-
None
-
Unspecified
-
Unspecified
-
Unspecified
-
All
-
None
Request: Make hwloc-libs >=2.5.0 available in RHEL 9.6
What were you trying to do that didn't work?
RHEL 9.6 is packaged with hwloc 2.4.1, we have found that this version doesn't contain support for VMs which are not NUMA-aligned to numa0 (it fails to produce topology if there is no numa0 node in the /sys tree)
What is the impact of this issue to you?
We are unable to build the topology necessary to support IBM Spyre cards on some VMs running RHEL9.6
With hwloc 2.4.1:
[root@rhaiis-vm-1 ~]# podman run --device=/dev/vfio \ -v ${HOST_MODELS_DIR}:/home/senuser/models \ -e VLLM_MODEL_PATH=/home/senuser/models/granite-3.3-8b-instruct \ -e VLLM_AIU_PCIE_IDS="${AIU_IDS}" \ -e VLLM_SPYRE_USE_CB=1 \ -e MAX_MODEL_LEN=3072 \ -e MAX_BATCH_SIZE=16 \ --pids-limit 0 \ --memory 100G \ --shm-size 64G \ -p 127.0.0.1::8000 b3e39a80bdf4 ---- IBM AIU Device Discovery... Topology does not contain any NUMA node, aborting! ---- --> Detecting PCIe devices... ---- --> Detected: 0 AIU PFs, 0 AIU VFs, 0 NICs, 0 NVMEs aiu-discover-topo: /project_src/aiu-toolbox/aiu-discovery/aiu-discover-topo.cpp:392: int main(int, char**): Assertion `server_name' failed. Signal Received: 6 (Aborted) Signal Received from pid=20 *** BACKTRACE *** /opt/ibm/spyre/senlib/lib/libsenlib-dd2.so(+0x6fd7f0)[0x7fff9b91d7f0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fff9cee0464] /lib64/glibc-hwcaps/power10/libc.so.6(+0xa58a8)[0x7fff985258a8] /lib64/glibc-hwcaps/power10/libc.so.6(gsignal+0x20)[0x7fff984c8360] /lib64/glibc-hwcaps/power10/libc.so.6(abort+0x134)[0x7fff984aa574] /lib64/glibc-hwcaps/power10/libc.so.6(+0x3dfd8)[0x7fff984bdfd8] /lib64/glibc-hwcaps/power10/libc.so.6(__assert_fail+0x58)[0x7fff984be048] /opt/ibm/spyre/bin/aiu-discover-topo[0x10013194] /lib64/glibc-hwcaps/power10/libc.so.6(+0x2a944)[0x7fff984aa944] /lib64/glibc-hwcaps/power10/libc.so.6(__libc_start_main+0x13c)[0x7fff984aab0c] *****************[root@rhaiis-vm-1 ~]# podman run --device=/dev/vfio \ -v ${HOST_MODELS_DIR}:/home/senuser/models \ -e VLLM_MODEL_PATH=/home/senuser/models/granite-3.3-8b-instruct \ -e VLLM_AIU_PCIE_IDS="${AIU_IDS}" \ -e VLLM_SPYRE_USE_CB=1 \ -e MAX_MODEL_LEN=3072 \ -e MAX_BATCH_SIZE=16 \ --pids-limit 0 \ --memory 100G \ --shm-size 64G \ -p 127.0.0.1::8000 b3e39a80bdf4 ---- IBM AIU Device Discovery... Topology does not contain any NUMA node, aborting! ---- --> Detecting PCIe devices... ---- --> Detected: 0 AIU PFs, 0 AIU VFs, 0 NICs, 0 NVMEs aiu-discover-topo: /project_src/aiu-toolbox/aiu-discovery/aiu-discover-topo.cpp:392: int main(int, char**): Assertion `server_name' failed. Signal Received: 6 (Aborted) Signal Received from pid=20 *** BACKTRACE *** /opt/ibm/spyre/senlib/lib/libsenlib-dd2.so(+0x6fd7f0)[0x7fff9b91d7f0] linux-vdso64.so.1(__kernel_sigtramp_rt64+0x0)[0x7fff9cee0464] /lib64/glibc-hwcaps/power10/libc.so.6(+0xa58a8)[0x7fff985258a8] /lib64/glibc-hwcaps/power10/libc.so.6(gsignal+0x20)[0x7fff984c8360] /lib64/glibc-hwcaps/power10/libc.so.6(abort+0x134)[0x7fff984aa574] /lib64/glibc-hwcaps/power10/libc.so.6(+0x3dfd8)[0x7fff984bdfd8] /lib64/glibc-hwcaps/power10/libc.so.6(__assert_fail+0x58)[0x7fff984be048] /opt/ibm/spyre/bin/aiu-discover-topo[0x10013194] /lib64/glibc-hwcaps/power10/libc.so.6(+0x2a944)[0x7fff984aa944] /lib64/glibc-hwcaps/power10/libc.so.6(__libc_start_main+0x13c)[0x7fff984aab0c] *****************
With hwloc v2.11.1
[root@pacpcm:~]: po logs 29daf23924d3 ---- IBM AIU Device Discovery... ---- --> Detecting PCIe devices... ---- --> Detected: 4 AIU PFs, 0 AIU VFs, 0 NICs, 0 NVMEs ---- --> Discovering AIU Metadata: Skipped ---- --> Verifying AIU Protocols: Skipped ---- --> Writing final topology file: /tmp/etc/ibm/spyre/topo.json ---- IBM AIU Environment Setup... (Generate environment from existing config) ---- IBM AIU Device Access: PF (config) ---- IBM AIU Devices Found: 4 ------------------------ ---- IBM AIU Device Discovery... (Using cache) ---- IBM AIU Environment Setup... (Generate environment from existing config) ---- IBM AIU Device Access: PF (config) ---- IBM AIU Devices Found: 4 ------------------------ /opt/vllm/lib64/python3.12/site-packages/torch/cuda/__init__.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you. import pynvml # type: ignore[import] WARNING:torchao.kernel.intmm:Warning: Detected no triton, on systems without Triton certain kernels will not work INFO 10-01 12:45:32 [__init__.py:36] Available plugins for group vllm.platform_plugins: INFO 10-01 12:45:32 [__init__.py:38] - spyre -> vllm_spyre:register INFO 10-01 12:45:32 [__init__.py:48] Loading plugin spyre INFO 10-01 12:45:32 [__init__.py:232] Platform plugin spyre is activated
This is the commit which fixes the issue in hwloc v2.5.0
commit 0114c2b0b3e39265e0829eebfff87ac9f4412fe9 Author: Brice Goglin <Brice.Goglin@inria.fr> Date: Mon Apr 26 20:35:42 2021 +0200 linux: fix support for NUMA node0 being offline Just like we didn't support offline CPU#0 until commit 7bcc273efd50536961ba16d474efca4ae163229b, we need to support node0 being offline as well. It's not clear whether it's a new Linux feature or not, this was reported on a POWER LPAR VM. The symptoms are different here because we got no NUMA nodes at all, hence the core hwloc added a default machine-wide node. But this node got marked disallowed by Linux cgroups. Hence load() failed with "Topology does not contain any NUMA node, aborting!" We opportunistically assume node0 is online to avoid the overhead in the vast majority of cases. If node0 is missing, we parse "online" to find the first node. Thanks to Jirka Hladky for the report. Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> hwloc/topology-linux.c | 25 +++++++++++++++++++++++++ 1 file changed, 25 insertions(+)
Please provide the package NVR for which the bug is seen:
How reproducible is this bug?:
Given a Power VM (with Spyre cards) without a numa0 node present, running our spyre-vllm image fails with easy reproduction.
Steps to reproduce
Expected results
Actual results
- blocks
-
AIPCC-3555 Tech Preview - Power/ppc64le: IBM Spyre AIU Accelerator support
-
- In Progress
-