Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60663

OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Important
    • Yes
    • x86_64
    • QA
    • None
    • None
    • OCP Node Sprint 275 (green)
    • 1
    • Done
    • Known Issue
    • Hide
      OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

      When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the `nvidia-operator-validator pod` in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1).

      This only happens on OCP 4.18.22.  This issue does not happen on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions
      Show
      OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the `nvidia-operator-validator pod` in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1). This only happens on OCP 4.18.22.  This issue does not happen on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions
    • None
    • None
    • None
    • None

      Description of problem:

      This issue will impact our current customers on OCP 4.18.22 who deploy the NVIDIA GPU Operator.  When deploying the clusterpolicy CR of the NVIDIA GPU Operator on OCP 4.18.22, the GPU stack fails to deploy successfully with the nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

      This only happened on OCP 4.18.22.  This issue does not happen not on OCP 4.18.21 or other OCP 4.19.8 or OCP 4.20-ec5 versions

       

      As a workaround, we followed this issue:

      containers/podman#16101

      Added "no-cgroups = true" to the GPU worker node /var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml file, and the GPU stack is now deployed successfully on OCP 4.18.22.

      [nvidia-container-cli]
        environment = []
        ldconfig = "@/run/nvidia/driver/sbin/ldconfig"
        load-kmods = true
        path = "/usr/local/nvidia/toolkit/nvidia-container-cli"
        root = "/run/nvidia/driver"
        no-cgroups = true
      

      Further investigations showed:
       
      One difference from 4.18.21 to 4.18.22 was a crun bump from 1.21 to 1.23.

      I was able to dig a bit more into this (ref: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/rh-ecosystem-edge_nvidia-ci/268/pull-ci-rh-ecosystem-edge-nvidia-ci-main-4.18-stable-nvidia-gpu-operator-e2e-25-3-x/1954619565199593472/artifacts/nvidia-gpu-operator-e2e-25-3-x/).

      Pulling the worker journal logs: see comment in: https://github.com/NVIDIA/gpu-operator/issues/1598#issuecomment-3201730167

      Aug 10 20:19:32.641438 ip-10-0-33-221 systemd[1]: Started crio-conmon-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope.
      Aug 10 20:19:32.649487 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: bpf-foreign: Failed to create foreign BPF program: Permission denied
      Aug 10 20:19:32.649496 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: bpf-foreign: Failed to prepare foreign BPF hashmap: Permission denied
      Aug 10 20:19:32.657397 ip-10-0-33-221 systemd[1]: Started libcrun container.
      Aug 10 20:19:32.711941 ip-10-0-33-221 systemd[1]: crio-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: Deactivated successfully.
      Aug 10 20:19:32.713855 ip-10-0-33-221 conmon[40135]: conmon 6ba7d8637fb000e431e1 <nwarn>: runtime stderr: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
      Aug 10 20:19:32.713878 ip-10-0-33-221 conmon[40135]: conmon 6ba7d8637fb000e431e1 <error>: Failed to create container: exit status 255
      Aug 10 20:19:32.714409 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.713964541Z" level=error msg="Container creation error: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)\n" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
      Aug 10 20:19:32.714594 ip-10-0-33-221 systemd[1]: crio-conmon-6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8.scope: Deactivated successfully.
      Aug 10 20:19:32.716368 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716328833Z" level=info msg="createCtr: deleting container ID 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8 from idIndex" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
      Aug 10 20:19:32.716515 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716374881Z" level=info msg="createCtr: removing container 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
      Aug 10 20:19:32.716515 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.716411999Z" level=info msg="createCtr: deleting container 6ba7d8637fb000e431e1a922daad7228cc45474709761a0e6d721db9a38a44e8 from storage" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
      Aug 10 20:19:32.730717 ip-10-0-33-221 crio[2502]: time="2025-08-10 20:19:32.730676776Z" level=info msg="createCtr: releasing container name k8s_toolkit-validation_nvidia-operator-validator-kkpjf_nvidia-gpu-operator_3492f33c-e20f-47ea-9693-dc3304fefd84_2" id=8c3bb895-1c3a-4a6f-a8b3-823950419b98 name=/runtime.v1.RuntimeService/CreateContainer
      Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]: E0810 20:19:32.730956    2561 log.go:32] "CreateContainer in sandbox from runtime service failed" err=<
      Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]:         rpc error: code = Unknown desc = container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
      Aug 10 20:19:32.730990 ip-10-0-33-221 kubenswrapper[2561]:  > podSandboxID="3f467232cfe54b8232419b143ba0f856120633bc841405d8847e64cff7d8b7f2"
      Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]: E0810 20:19:32.731082    2561 kuberuntime_manager.go:1274] "Unhandled Error" err=<
      Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]:         init container &Container{Name:toolkit-validation,Image:[nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a,Command:[sh|http://nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:e183dc07e5889bd9e269c320ffad7f61df655f57ecc3aa158c4929e74528420a,Command:%5Bsh] -c],Args:[nvidia-validator],WorkingDir:,Ports:[]ContainerPort{},Env:[]EnvVar\{EnvVar{Name:NVIDIA_VISIBLE_DEVICES,Value:all,ValueFrom:nil,},EnvVar\{Name:WITH_WAIT,Value:false,ValueFrom:nil,},EnvVar\{Name:COMPONENT,Value:toolkit,ValueFrom:nil,},},Resources:ResourceRequirements\{Limits:ResourceList{},Requests:ResourceList{},Claims:[]ResourceClaim{},},VolumeMounts:[]VolumeMount\{VolumeMount{Name:run-nvidia-validations,ReadOnly:false,MountPath:/run/nvidia/validations,SubPath:,MountPropagation:*Bidirectional,SubPathExpr:,RecursiveReadOnly:nil,},VolumeMount{Name:kube-api-access-rqmsz,ReadOnly:true,MountPath:/var/run/secrets/[kubernetes.io/serviceaccount,SubPath|http://kubernetes.io/serviceaccount,SubPath]:,MountPropagation:nil,SubPathExpr:,RecursiveReadOnly:nil,},},LivenessProbe:nil,ReadinessProbe:nil,Lifecycle:nil,TerminationMessagePath:/dev/termination-log,ImagePullPolicy:IfNotPresent,SecurityContext:&SecurityContext\{Capabilities:nil,Privileged:*true,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,ProcMount:nil,WindowsOptions:nil,SeccompProfile:nil,AppArmorProfile:nil,},Stdin:false,StdinOnce:false,TTY:false,EnvFrom:[]EnvFromSource{},TerminationMessagePolicy:File,VolumeDevices:[]VolumeDevice{},StartupProbe:nil,ResizePolicy:[]ContainerResizePolicy{},RestartPolicy:nil,} start failed in pod nvidia-operator-validator-kkpjf_nvidia-gpu-operator(3492f33c-e20f-47ea-9693-dc3304fefd84): CreateContainerError: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
      Aug 10 20:19:32.731462 ip-10-0-33-221 kubenswrapper[2561]:  > logger="UnhandledError"
      

       
      Version-Release number of selected component (if applicable):

          4.18.22

      How reproducible:

          All the time

      Steps to Reproduce:

          1. Create an IPI AWS SNO node in AWS
          2. Add a g4dn.xlarge GPU enabled machines to add a worker node
          3. Deploy NFD operator plus operand
          4.  Deploy the NVIDIA GPU Operator from certified operators catalog, from console )or CLI) latest version v25.3.2
          5.  Deploy clusterpolicy with default settings
          6.  check pods in nvidia-gpu-operator:  oc get pods -n nvdiai-gpu-operator
      
      The nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)  
      
      More details are in this issue:

      Actual results:

          ```
      $ oc get pods -n nvidia-gpu-operator
      NAME                                                  READY   STATUS                      RESTARTS        AGE
      gpu-feature-discovery-q2fln                           0/1     Init:0/1                    0               14m
      gpu-operator-5fcc456c94-wdq8x                         1/1     Running                     0               16m
      nvidia-container-toolkit-daemonset-tc69k              1/1     Running                     0               14m
      nvidia-dcgm-exporter-d2qrj                            0/1     Init:0/2                    0               14m
      nvidia-dcgm-x6b6v                                     0/1     Init:0/1                    0               14m
      nvidia-device-plugin-daemonset-92hr4                  0/1     Init:0/1                    0               14m
      nvidia-driver-daemonset-418.94.202508060022-0-wnrz2   2/2     Running                     0               14m
      nvidia-node-status-exporter-pr2zj                     1/1     Running                     0               14m
      nvidia-operator-validator-wsrpg                       0/1     Init:CreateContainerError   2 (8m14s ago)   14m
      ```

      Expected results:

          ```
      $ oc get pods -n nvidia-gpu-operator
      NAME                                                  READY   STATUS      RESTARTS   AGE
      gpu-feature-discovery-ssx5r                           1/1     Running     0          28h
      gpu-operator-5fcc456c94-wdq8x                         1/1     Running     0          37h
      nvidia-container-toolkit-daemonset-tc69k              1/1     Running     0          37h
      nvidia-cuda-validator-kzfjn                           0/1     Completed   0          2m16s
      nvidia-dcgm-exporter-d2qrj                            1/1     Running     0          37h
      nvidia-dcgm-x6b6v                                     1/1     Running     0          37h
      nvidia-device-plugin-daemonset-92hr4                  1/1     Running     0          37h
      nvidia-driver-daemonset-418.94.202508060022-0-wnrz2   2/2     Running     0          37h
      nvidia-node-status-exporter-pr2zj                     1/1     Running     0          37h
      nvidia-operator-validator-v4j75                       1/1     Running     0          28h
      ```

      Additional info:

          https://github.com/NVIDIA/gpu-operator/issues/1598

              gscrivan@redhat.com Giuseppe Scrivano
              walid@redhat.com Walid Abouhamad
              None
              None
              Michael Nguyen Michael Nguyen
              None
              Votes:
              0 Vote for this issue
              Watchers:
              21 Start watching this issue

                Created:
                Updated:
                Resolved: