-
Bug
-
Resolution: Done-Errata
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
CLOSED
-
Important
-
None
Description of problem:
After configuring Nvidia GPU Operator,
The below pods were not found, for Ampere based GPU Cards.
nvidia-sandbox-device-plugin-daemonset-5rsv9
nvidia-sandbox-device-plugin-daemonset-q225z
nvidia-sandbox-validator-996wt
nvidia-sandbox-validator-shwj9
Probably this is because the "lspci" command was not found in the container "openshift-driver-toolkit-ctr" in pod "nvidia-vgpu-manager-daemonset-411.86.202208031059-0"
[kbidarka@localhost nvidia-gpu-operator]$ oc logs -c openshift-driver-toolkit-ctr -f nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh | grep -A 5 "sriov-manage"
+ /usr/lib/nvidia/sriov-manage -e ALL
/usr/lib/nvidia/sriov-manage: line 259: lspci: command not found
+ return 0
Done, now waiting for signal
+ echo 'Done, now waiting for signal'
+ trap 'echo '\''Caught signal'\''; _shutdown; trap - EXIT; exit' HUP INT QUIT PIPE TERM
+ true
Version-Release number of selected component (if applicable):
How reproducible:
Installing Nvidia GPU Operator on Ampere GPU Architecture.
Steps to Reproduce:
1.
2.
3.
Actual results:
+ /usr/lib/nvidia/sriov-manage -e ALL
/usr/lib/nvidia/sriov-manage: line 259: lspci: command not found
Expected results:
+ /usr/lib/nvidia/sriov-manage -e ALL
The above command should run fine.
Additional info:
Workaround:
1) Install pciutils package in the "openshift-driver-toolkit-ctr"
2) and then label the node with "vgpu.config=<MDEV-TYPE"
1) oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-8wfxh -it -c openshift-driver-toolkit-ctr – /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL' ; oc -n nvidia-gpu-operator exec pod/nvidia-vgpu-manager-daemonset-411.86.202208031059-0-vk7pw -it -c openshift-driver-toolkit-ctr – /bin/sh -euxc 'dnf install -y pciutils; /usr/lib/nvidia/sriov-manage -e ALL'
2) oc label node node32.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q ; oc label node node33.redhat.com --overwrite nvidia.com/vgpu.config=A2-2Q