Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.18.z, 4.19.z, 4.20.z
Component/s: Node / CRI-O
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None
Architecture:

x86_64, aarch64

Target Backport Versions:
None
Target Version:

4.18.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, nVidia GPU workloads failed to run when using `crun` and `readOnlyRootFilesystem=true` because the nVidia hooks require a writable filesystem. With this release, `crun` has been updated to run `createContainer` hooks before making root read-only, allowing those hooks to run properly. (link:https://issues.redhat.com/browse/OCPBUGS-66249[~~OCPBUGS-66249~~])

Show
Before this update, nVidia GPU workloads failed to run when using `crun` and `readOnlyRootFilesystem=true` because the nVidia hooks require a writable filesystem. With this release, `crun` has been updated to run `createContainer` hooks before making root read-only, allowing those hooks to run properly. (link: https://issues.redhat.com/browse/OCPBUGS-66249 [ OCPBUGS-66249 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Workloads that try to consume GPUs via the NVIDIA GPU Operator, and have `readOnlyRootFilesystem: true` in their spec, will fail with one of the following errors:

* If CDI is enabled in GPU operator's ClusterPolicy:
```
error executing hook `/usr/local/nvidia/toolkit/nvidia-cdi-hook` (exit code: 1)
```

* If CDI is disabled in GPU operator's ClusterPolicy: 
```
error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
```

According to a customer who has a mutating webhook that adds `readOnlyRootFilesystem: true` for compliance reasons, the issue has started after the cluster was upgraded from 4.19.17 to 4.19.19.

Version-Release number of selected component (if applicable):

Known OpenShift versions 4.19.19, 4.18.27, 4.20.3

How reproducible:

I could easily reproduce it with CDI enabled on 4.19.17 and 4.19.19, but with CDI disabled everything works fine.

Steps to Reproduce:

    1. Cluster with an NVIDIA GPU (e.g. g4dn.2xlarge on AWS)
    2. Install NFD operator + NodeFeatureDiscovery CR.
    3. Install the NVIDIA GPU operator 25.10 + ClusterPolicy.
    4. Run a pod that request "nvidia.com/gpu: 1" and has "readOnlyRootFilesystem: true" in its securityContext.

Actual results:

Pod status: CreateContainerError
    
Events ("oc describe pod" command) show:
Warning  Failed          9s (x2 over 10s)  kubelet            Error: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-cdi-hook` (exit code: 1)

Expected results:

The pod starts successfully

Additional info:

Test workload (based on https://github.com/wilicc/gpu-burn):

---
apiVersion: v1
kind: Namespace
metadata:
  name: gpu-burn
---
apiVersion: v1
kind: Pod
metadata:
  name: gpu-burn
  namespace: gpu-burn
spec:
  containers:
    - name: gpu-burn
      image: quay.io/vemporop/gpu-burn:cuda12.1.0-ubi8
      command: ["/app/gpu_burn"]
      args: ["-m", "50%", "999999999"]
      resources:
        requests:
          nvidia.com/gpu: 1
        limits:
          nvidia.com/gpu: 1
      securityContext:
        seccompProfile:
          type: RuntimeDefault
        capabilities:
          drop:
            - ALL
        runAsNonRoot: true
        allowPrivilegeEscalation: false
        readOnlyRootFilesystem: true

Mitigation:

* Downgrade the NVIDIA GPU Operator to 25.3
* Explicitly disable CDI in ClusterPolicy `spec.cdi.enabled: false` (confirmed in some environments, but might not always work)
* Switch low-level container runtime from `crun` to `runc`

causes

OCPBUGS-76625 Microshift pods fails to start with error container create failed: systemd failed to install eBPF device filter on cgroup

ASSIGNED

Assignee:: Jindrich Novy

Reporter:: Vitaliy Emporopulo

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2025/12/02 3:31 PM

Updated:: 2026/02/17 7:25 AM

Resolved:: 2026/02/11 4:28 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates