-
Bug
-
Resolution: Done
-
Critical
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
3
-
Important
-
Yes
-
x86_64
-
QA
-
None
-
None
-
OCP Node Sprint 275 (green)
-
1
-
Done
-
Known Issue
-
-
None
-
None
-
None
-
None
Description of problem:
This issue will impact our current customers on OCP 4.18.22 who deploy the NVIDIA GPU Operator. When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
This only happened on OCP 4.18.22. This issue does not happen not on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions
As a workaround, we followed this issue:
Added "no-cgroups = true" to the GPU worker node /var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml file, and the GPU stack is now deployed successfully on OCP 4.18.22.
[nvidia-container-cli] environment = [] ldconfig = "@/run/nvidia/driver/sbin/ldconfig" load-kmods = true path = "/usr/local/nvidia/toolkit/nvidia-container-cli" root = "/run/nvidia/driver" no-cgroups = true
Further investigations showed:
One difference from 4.18.21 to 4.18.22 was a crun bump from 1.21 to 1.23.
I was able to dig a bit more into this (ref: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/rh-ecosystem-edge_nvidia-ci/268/pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-25-3-x/1954619565199593472/artifacts/nvidia-gpu-operator-e2e-25-3-x/).
Pulling the worker journal logs: see comment in: https://github.com/NVIDIA/gpu-operator/issues/1598#issuecomment-3201730167
Aug 10 20:19:32.641438 ip-10-0-33-221 systemd[1]: Started crio-conmon-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope. Aug 10 20:19:32.649487 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: bpf-foreign: Failed to create foreign BPF program: Permission denied Aug 10 20:19:32.649496 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: bpf-foreign: Failed to prepare foreign BPF hashmap: Permission denied Aug 10 20:19:32.657397 ip-10-0-33-221 systemd[1]: Started libcrun container. Aug 10 20:19:32.711941 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: Deactivated successfully. Aug 10 20:19:32.713855 ip-10-0-33-221 conmon[40135]: conmon 6ba7d8637fb000e431e1 <nwarn>: runtime stderr: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) Aug 10 20:19:32.713878 ip-10-0-33-221 conmon[40135]: conmon 6ba7d8637fb000e431e1 <error>: Failed to create container: exit status 255 Aug 10 20:19:32.714409 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.713964541Z" level=error msg="Container creation error: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)\n" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer Aug 10 20:19:32.714594 ip-10-0-33-221 systemd[1]: crio-conmon-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: Deactivated successfully. Aug 10 20:19:32.716368 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716328833Z" level=info msg="createCtr: deleting container ID 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8 from idIndex" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer Aug 10 20:19:32.716515 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716374881Z" level=info msg="createCtr: removing container 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer Aug 10 20:19:32.716515 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716411999Z" level=info msg="createCtr: deleting container 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8 from storage" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer Aug 10 20:19:32.730717 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.730676776Z" level=info msg="createCtr: releasing container name k8s_toolkit-validation_nvidia-operator-validator-kkpjf_nvidia-gpu-operator_3492f33c-e20f-47ea-9693-dc3304fefd84_2" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]: E0810 20:19:32.730956 2561 log.go:32] "CreateContainer in sandbox from runtime service failed" err=< Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]: rpc error: code = Unknown desc = container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]: > podSandboxID="3f467232cfe54b8232419b143ba0f856120633bc841405d8847e64cff7d8b7f2" Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]: E0810 20:19:32.731082 2561 kuberuntime_manager.go:1274] "Unhandled Error" err=< Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]: init container &Container{Name:toolkit-validation,Image:[nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a,Command:[sh|http://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a,Command:%5Bsh] -c],Args:[nvidia-validator],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar\{EnvVar{Name:NVIDIA_VISIBLE_DEVICES,Value:all,ValueFrom:nil,},EnvVar\{Name:WITH_WAIT,Value:false,ValueFrom:nil,},EnvVar\{Name:COMPONENT,Value:toolkit,ValueFrom:nil,},},Resources:ResourceRequirements\{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount\{VolumeMount{Name:run-nvidia-validations,ReadOnly:false,MountPath:/run/nvidia/validations,SubPath:,MountPropagation:*Bidirectional,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-rqmsz,ReadOnly:true,MountPath:/var/run/secrets/[kubernetes.io/serviceaccount,SubPath|http://kubernetes.io/serviceaccount,SubPath]:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext\{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod nvidia-operator-validator-kkpjf_nvidia-gpu-operator(3492f33c-e20f-47ea-9693-dc3304fefd84): CreateContainerError: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]: > logger="UnhandledError"
Version-Release number of selected component (if applicable):
4.18.22
How reproducible:
All the time
Steps to Reproduce:
1. Create an IPI AWS SNO node in AWS 2. Add a g4dn.xlarge GPU enabled machines to add a worker node 3. Deploy NFD operator plus operand 4. Deploy the NVIDIA GPU Operator from certified operators catalog, from console )or CLI) latest version v25.3.2 5. Deploy clusterpolicy with default settings 6. check pods in nvidia-gpu-operator: oc get pods -n nvdiai-gpu-operator The nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) More details are in this issue:
Actual results:
``` $ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-q2fln 0/1 Init:0/1 0 14m gpu-operator-5fcc456c94-wdq8x 1/1 Running 0 16m nvidia-container-toolkit-daemonset-tc69k 1/1 Running 0 14m nvidia-dcgm-exporter-d2qrj 0/1 Init:0/2 0 14m nvidia-dcgm-x6b6v 0/1 Init:0/1 0 14m nvidia-device-plugin-daemonset-92hr4 0/1 Init:0/1 0 14m nvidia-driver-daemonset-418.94.202508060022-0-wnrz2 2/2 Running 0 14m nvidia-node-status-exporter-pr2zj 1/1 Running 0 14m nvidia-operator-validator-wsrpg 0/1 Init:CreateContainerError 2 (8m14s ago) 14m ```
Expected results:
``` $ oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-ssx5r 1/1 Running 0 28h gpu-operator-5fcc456c94-wdq8x 1/1 Running 0 37h nvidia-container-toolkit-daemonset-tc69k 1/1 Running 0 37h nvidia-cuda-validator-kzfjn 0/1 Completed 0 2m16s nvidia-dcgm-exporter-d2qrj 1/1 Running 0 37h nvidia-dcgm-x6b6v 1/1 Running 0 37h nvidia-device-plugin-daemonset-92hr4 1/1 Running 0 37h nvidia-driver-daemonset-418.94.202508060022-0-wnrz2 2/2 Running 0 37h nvidia-node-status-exporter-pr2zj 1/1 Running 0 37h nvidia-operator-validator-v4j75 1/1 Running 0 28h ```
Additional info:
https://github.com/NVIDIA/gpu-operator/issues/1598
- is blocked by
-
RUN-3446 Impact OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
-
- Closed
-
- links to