Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64821

OCP 4.18.24 : The nvidia-operator-validator pods are in Init:CreateContainerError

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • None
    • None
    • Node Blue Sprint 280
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Describe your problem. Include specific actions and error messages.
      We are seeing this issue in few of our OpenShift clusters with versions  4.18.24 and 4.19.17.

      The nvidia-operator-validator pods are in Init:CreateContainerError with the following in pod events:
      ```
      error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
      ```

      We also found the following log in `toolkit-validation` container:
      ```
      time="2025-11-06T13:39:12Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH".
      ```

      I found the issue https://github.com/NVIDIA/gpu-operator/issues/1598 which talks about this specific issue, but says that it should be fixed on 4.18.24. But we are seeing the same issue on 4.18.24.
      I checked the `/var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml` on one of the nodes with failing nvidia-operator-validator pod. I found that `no-cgroups = false` was set under `[nvidia-container-cli]`. After updating it to `no-cgroups = true` and restarting the pod fixed the issue on that node.

      Describe the impact to you or the business
      We are currently seeing this in few of our environments, but I fear that it might spread to other environments.

              svanka@redhat.com Sai Ramesh Vanka
              rhn-support-nchoudhu Novonil Choudhuri
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: