-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18.z
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
None
-
None
-
None
-
None
-
Node Blue Sprint 280
-
1
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Describe your problem. Include specific actions and error messages.
We are seeing this issue in few of our OpenShift clusters with versions 4.18.24 and 4.19.17.
The nvidia-operator-validator pods are in Init:CreateContainerError with the following in pod events:
```
error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
```
We also found the following log in `toolkit-validation` container:
```
time="2025-11-06T13:39:12Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH".
```
I found the issue https://github.com/NVIDIA/gpu-operator/issues/1598 which talks about this specific issue, but says that it should be fixed on 4.18.24. But we are seeing the same issue on 4.18.24.
I checked the `/var/usrlocal/nvidia/toolkit/.config/nvidia-container-runtime/config.toml` on one of the nodes with failing nvidia-operator-validator pod. I found that `no-cgroups = false` was set under `[nvidia-container-cli]`. After updating it to `no-cgroups = true` and restarting the pod fixed the issue on that node.
Describe the impact to you or the business
We are currently seeing this in few of our environments, but I fear that it might spread to other environments.
- impacts account
-
OCPBUGS-60663 OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
-
- Closed
-
- is blocked by
-
OCPNODE-3873 Impact statement request for OCPBUGS-64821 OCP 4.18.24 : The nvidia-operator-validator pods are in Init:CreateContainerError
-
- To Do
-