-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhelai-1.5
-
True
-
-
False
-
-
To Reproduce Steps to reproduce the behavior:
While serving a model on RHEL AI 1.5 on baremetal server with Gaudi 3 AI Accelerator:
run the following command as non root from the host: hl-smi
[redhat@g3-srv15-c03b-idc ~]$ hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.20.1-fw-58.2.7.0 |
| Driver Version: 1.20.1-366eb9c |
| Nic Driver Version: 1.20.1-213b09b |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-325 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 38C P0 261W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 1 HL-325 N/A | 0000:21:00.0 N/A | 0 |
| N/A 38C P0 264W / 900W | 672MiB / 131072MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 2 HL-325 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 34C P0 263W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 3 HL-325 N/A | 0000:34:00.0 N/A | 0 |
| N/A 31C P0 260W / 900W | 672MiB / 131072MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 4 HL-325 N/A | 0000:22:00.0 N/A | 0 |
| N/A 31C P0 265W / 900W | 672MiB / 131072MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 5 HL-325 N/A | 0000:ae:00.0 N/A | 0 |
| N/A 32C P0 266W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| 6 HL-325 N/A | 0000:35:00.0 N/A | 0 |
| N/A 36C P0 271W / 900W |131072MiB / 131072MiB | 0% 100% |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 498056 C python3.11 130400MiB
| 1 N/A N/A N/A N/A |
| 2 498188 C python3.11 130400MiB
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 498189 C python3.11 130400MiB
| 6 498187 C python3.11 130400MiB
+=============================================================================+
[redhat@g3-srv15-c03b-idc ~]$
The memory usage is showing 100%
Running the same commande within the running container show the proper usage:
redhat@g3-srv15-c03b-idc ~]$ podman exec -i -t 4795caa3dcf1 /bin/sh
(app-root) /$ hl-smi
+-----------------------------------------------------------------------------+
| HL-SMI Version: hl-1.20.1-fw-58.2.7.0 |
| Driver Version: 1.20.1-366eb9c |
| Nic Driver Version: 1.20.1-213b09b |
|-------------------------------+----------------------+----------------------+
| AIP Name Persistence-M| Bus-Id Disp.A | Volatile Uncor-Events|
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | AIP-Util Compute M. |
|===============================+======================+======================|
| 0 HL-325 N/A | 0000:9a:00.0 N/A | 0 |
| N/A 38C P0 262W / 900W |107181MiB / 131072MiB | 0% 81% |
|-------------------------------+----------------------+----------------------+
| 1 HL-325 N/A | 0000:21:00.0 N/A | 0 |
| N/A 38C P0 264W / 900W | 672MiB / 131072MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 2 HL-325 N/A | 0000:9b:00.0 N/A | 0 |
| N/A 34C P0 264W / 900W |107181MiB / 131072MiB | 0% 81% |
|-------------------------------+----------------------+----------------------+
| 3 HL-325 N/A | 0000:34:00.0 N/A | 0 |
| N/A 32C P0 260W / 900W | 672MiB / 131072MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 4 HL-325 N/A | 0000:22:00.0 N/A | 0 |
| N/A 32C P0 265W / 900W | 672MiB / 131072MiB | 0% 0% |
|-------------------------------+----------------------+----------------------+
| 5 HL-325 N/A | 0000:ae:00.0 N/A | 0 |
| N/A 32C P0 266W / 900W |107182MiB / 131072MiB | 0% 81% |
|-------------------------------+----------------------+----------------------+
| 6 HL-325 N/A | 0000:35:00.0 N/A | 0 |
| N/A 36C P0 271W / 900W |107182MiB / 131072MiB | 0% 81% |
|-------------------------------+----------------------+----------------------+
| Compute Processes: AIP Memory |
| AIP PID Type Process name Usage |
|=============================================================================|
| 0 24 C python3.11 106509MiB
| 1 N/A N/A N/A N/A |
| 2 156 C python3.11 106509MiB
| 3 N/A N/A N/A N/A |
| 4 N/A N/A N/A N/A |
| 5 157 C python3.11 106510MiB
| 6 155 C python3.11 106510MiB
+=============================================================================+
(app-root) /$
Expected behavior
Actual memory usage should be shown as seen when running hl-smi from within the running container.
Screenshots
Device Info (please complete the following information):
- Hardware Specs: Gaudi 3
- OS Version: RHEL AI 1.5
- InstructLab Version: ilab, version 0.26.1
- Provide the output of these two commands:
- sudo bootc status --format json | jq .status.booted.image.image.image "registry.stage.redhat.io/rhelai1/bootc-intel-rhel9:1.5-1747201255"
- ilab system info (see file attached.)g3-srv15-c03b-idc-ilab-system-info
Bug impact
Incorrect memory usage reported.
Known workaround
Include the following argument as part of the podman running command:
--ipc=host
Additional context
- NA