Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.18.z
Affects Version/s: 4.18.z
Component/s: Containers
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Important
Regression:
Yes
Architecture:

x86_64
Deployment Environment:
QA

Target Backport Versions:
None
Target Version:

4.18.z
Release Blocker:
None
Sprint:
OCP Node Sprint 275 (green)
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Known Issue
Release Note Text:

Hide
OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the `nvidia-operator-validator pod` in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1).

This only happens on OCP 4.18.22. This issue does not happen on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions

Show
OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the `nvidia-operator-validator pod` in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1). This only happens on OCP 4.18.22. This issue does not happen on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

This issue will impact our current customers on OCP 4.18.22 who deploy the NVIDIA GPU Operator. When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

This only happened on OCP 4.18.22. This issue does not happen not on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions

As a workaround, we followed this issue:

containers/podman#16101

Added "no-cgroups = true" to the GPU worker node /var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml file, and the GPU stack is now deployed successfully on OCP 4.18.22.

[nvidia-container-cli]
  environment = []
  ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
  load-kmods = true
  path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
  root = "/run/nvidia/driver"
  no-cgroups = true

Further investigations showed:

One difference from 4.18.21 to 4.18.22 was a crun bump from 1.21 to 1.23.

I was able to dig a bit more into this (ref: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/rh-ecosystem-edge_nvidia-ci/268/pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-25-3-x/1954619565199593472/artifacts/nvidia-gpu-operator-e2e-25-3-x/).

Pulling the worker journal logs: see comment in: https://github.com/NVIDIA/gpu-operator/issues/1598#issuecomment-3201730167

Aug 10 20:19:32.641438 ip-10-0-33-221 systemd[1]: Started crio-conmon-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope.
Aug 10 20:19:32.649487 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: bpf-foreign: Failed to create foreign BPF program: Permission denied
Aug 10 20:19:32.649496 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: bpf-foreign: Failed to prepare foreign BPF hashmap: Permission denied
Aug 10 20:19:32.657397 ip-10-0-33-221 systemd[1]: Started libcrun container.
Aug 10 20:19:32.711941 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: Deactivated successfully.
Aug 10 20:19:32.713855 ip-10-0-33-221 conmon[40135]: conmon 6ba7d8637fb000e431e1 <nwarn>: runtime stderr: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
Aug 10 20:19:32.713878 ip-10-0-33-221 conmon[40135]: conmon 6ba7d8637fb000e431e1 <error>: Failed to create container: exit status 255
Aug 10 20:19:32.714409 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.713964541Z" level=error msg="Container creation error: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)\n" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
Aug 10 20:19:32.714594 ip-10-0-33-221 systemd[1]: crio-conmon-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: Deactivated successfully.
Aug 10 20:19:32.716368 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716328833Z" level=info msg="createCtr: deleting container ID 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8 from idIndex" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
Aug 10 20:19:32.716515 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716374881Z" level=info msg="createCtr: removing container 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
Aug 10 20:19:32.716515 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716411999Z" level=info msg="createCtr: deleting container 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8 from storage" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
Aug 10 20:19:32.730717 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.730676776Z" level=info msg="createCtr: releasing container name k8s_toolkit-validation_nvidia-operator-validator-kkpjf_nvidia-gpu-operator_3492f33c-e20f-47ea-9693-dc3304fefd84_2" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]: E0810 20:19:32.730956    2561 log.go:32] "CreateContainer in sandbox from runtime service failed" err=<
Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]:         rpc error: code = Unknown desc = container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]:  > podSandboxID="3f467232cfe54b8232419b143ba0f856120633bc841405d8847e64cff7d8b7f2"
Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]: E0810 20:19:32.731082    2561 kuberuntime_manager.go:1274] "Unhandled Error" err=<
Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]:         init container &Container{Name:toolkit-validation,Image:[nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a,Command:[sh|http://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a,Command:%5Bsh] -c],Args:[nvidia-validator],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar\{EnvVar{Name:NVIDIA_VISIBLE_DEVICES,Value:all,ValueFrom:nil,},EnvVar\{Name:WITH_WAIT,Value:false,ValueFrom:nil,},EnvVar\{Name:COMPONENT,Value:toolkit,ValueFrom:nil,},},Resources:ResourceRequirements\{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount\{VolumeMount{Name:run-nvidia-validations,ReadOnly:false,MountPath:/run/nvidia/validations,SubPath:,MountPropagation:*Bidirectional,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-rqmsz,ReadOnly:true,MountPath:/var/run/secrets/[kubernetes.io/serviceaccount,SubPath|http://kubernetes.io/serviceaccount,SubPath]:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext\{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod nvidia-operator-validator-kkpjf_nvidia-gpu-operator(3492f33c-e20f-47ea-9693-dc3304fefd84): CreateContainerError: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]:  > logger="UnhandledError"

Version-Release number of selected component (if applicable):

    4.18.22

How reproducible:

    All the time

Steps to Reproduce:

    1. Create an IPI AWS SNO node in AWS
    2. Add a g4dn.xlarge GPU enabled machines to add a worker node
    3. Deploy NFD operator plus operand
    4.  Deploy the NVIDIA GPU Operator from certified operators catalog, from console )or CLI) latest version v25.3.2
    5.  Deploy clusterpolicy with default settings
    6.  check pods in nvidia-gpu-operator:  oc get pods -n nvdiai-gpu-operator

The nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)  

More details are in this issue:

Actual results:

    ```
$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS                      RESTARTS        AGE
gpu-feature-discovery-q2fln                           0/1     Init:0/1                    0               14m
gpu-operator-5fcc456c94-wdq8x                         1/1     Running                     0               16m
nvidia-container-toolkit-daemonset-tc69k              1/1     Running                     0               14m
nvidia-dcgm-exporter-d2qrj                            0/1     Init:0/2                    0               14m
nvidia-dcgm-x6b6v                                     0/1     Init:0/1                    0               14m
nvidia-device-plugin-daemonset-92hr4                  0/1     Init:0/1                    0               14m
nvidia-driver-daemonset-418.94.202508060022-0-wnrz2   2/2     Running                     0               14m
nvidia-node-status-exporter-pr2zj                     1/1     Running                     0               14m
nvidia-operator-validator-wsrpg                       0/1     Init:CreateContainerError   2 (8m14s ago)   14m
```

Expected results:

    ```
$ oc get pods -n nvidia-gpu-operator
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-ssx5r                           1/1     Running     0          28h
gpu-operator-5fcc456c94-wdq8x                         1/1     Running     0          37h
nvidia-container-toolkit-daemonset-tc69k              1/1     Running     0          37h
nvidia-cuda-validator-kzfjn                           0/1     Completed   0          2m16s
nvidia-dcgm-exporter-d2qrj                            1/1     Running     0          37h
nvidia-dcgm-x6b6v                                     1/1     Running     0          37h
nvidia-device-plugin-daemonset-92hr4                  1/1     Running     0          37h
nvidia-driver-daemonset-418.94.202508060022-0-wnrz2   2/2     Running     0          37h
nvidia-node-status-exporter-pr2zj                     1/1     Running     0          37h
nvidia-operator-validator-v4j75                       1/1     Running     0          28h
```

Additional info:

    https://github.com/NVIDIA/gpu-operator/issues/1598

is blocked by

RUN-3446 Impact OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

Closed

links to

crun stopgap

KCS 7131271: NVIDIA GPU Operator Validator Pod Error

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide