-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18.z, 4.19.z, 4.20.z
-
None
-
None
-
False
-
-
None
-
None
-
None
-
x86_64, aarch64
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Workloads that try to consume GPUs via the NVIDIA GPU Operator, and have `readOnlyRootFilesystem: true` in their spec, will fail with one of the following errors: * If CDI is enabled in GPU operator's ClusterPolicy: ``` error executing hook `/usr/local/nvidia/toolkit/nvidia-cdi-hook` (exit code: 1) ``` * If CDI is disabled in GPU operator's ClusterPolicy: ``` error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1) ``` According to a customer who has a mutating webhook that adds `readOnlyRootFilesystem: true` for compliance reasons, the issue has started after the cluster was upgraded from 4.19.17 to 4.19.19.
Version-Release number of selected component (if applicable):
Known OpenShift versions 4.19.19, 4.18.27, 4.20.3
How reproducible:
I could easily reproduce it with CDI enabled on 4.19.17 and 4.19.19, but with CDI disabled everything works fine.
Steps to Reproduce:
1. Cluster with an NVIDIA GPU (e.g. g4dn.2xlarge on AWS)
2. Install NFD operator + NodeFeatureDiscovery CR.
3. Install the NVIDIA GPU operator 25.10 + ClusterPolicy.
4. Run a pod that request "nvidia.com/gpu: 1" and has "readOnlyRootFilesystem: true" in its securityContext.
Actual results:
Pod status: CreateContainerError
Events ("oc describe pod" command) show:
Warning Failed 9s (x2 over 10s) kubelet Error: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-cdi-hook` (exit code: 1)
Expected results:
The pod starts successfully
Additional info:
Test workload (based on https://github.com/wilicc/gpu-burn): --- apiVersion: v1 kind: Namespace metadata: name: gpu-burn --- apiVersion: v1 kind: Pod metadata: name: gpu-burn namespace: gpu-burn spec: containers: - name: gpu-burn image: quay.io/vemporop/gpu-burn:cuda12.1.0-ubi8 command: ["/app/gpu_burn"] args: ["-m", "50%", "999999999"] resources: requests: nvidia.com/gpu: 1 limits: nvidia.com/gpu: 1 securityContext: seccompProfile: type: RuntimeDefault capabilities: drop: - ALL runAsNonRoot: true allowPrivilegeEscalation: false readOnlyRootFilesystem: true
Mitigation:
* Downgrade the NVIDIA GPU Operator to 25.3 * Explicitly disable CDI in ClusterPolicy `spec.cdi.enabled: false` (confirmed in some environments, but might not always work) * Switch low-level container runtime from `crun` to `runc`