Loading...

XML

Word

Printable

Type: Spike
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- UpgradeBlocker

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Which 4.y.z to 4.y'.z' updates increase vulnerability?

Any updates into 4.18.22 or later, until the 4.18.z that ships the fix for ~~OCPBUGS-60663~~. 4.19 and 4.17 are not affected.

Which types of clusters?

Clusters with the NVIDIA GPU Operator installed and using crun on GPU-hosting Nodes. PromQL that returns 1 on "exposed" and 0 on "not exposed, and the relevant metrics are working", and no-results on "relevant metrics are not working" is:

group by (name) (csv_succeeded{_id="", name=~"gpu-operator-certified[.].*"})
or on (_id)
0 * group(csv_count{_id=""})

gpu-operator-certified seems like a surprising ClusterServiceVersion name prefix for "NVIDIA GPU Operator", but that's the name mentioned in these docs, and it's what shows up in this CI run.

4.18 supports both crun and runc, but I'm not aware of in-cluster PromQL that could distinguish runc from crun Nodes.

What is the impact?

The nvidia-operator-validator Pod's container creation can fail, with logs mentioning nvidia-container-runtime-hook. Cluster users cannot run GPU-enabled workloads on impacted clusters.

How involved is remediation?

On an affected release, you can pivot GPU-hosting Nodes from crun to runc. Alternatively, update to a release with the fix for ~~OCPBUGS-60663~~, or to a release that was never affected, like 4.19.

If you're already impacted by this please contact support.

Is this a regression?

Yes, 4.18.22 shipped crun 1.23, which began managing devices rules with eBPF, rather than through cgroups. This is more idiomatic with how systemd states devices should be managed. However, there was a gap in the SELinux policy that caused this management to fail. The long-term fix is an update to the container-selinux policy, which is shipped in version 2.235.0-3 of the package.

OCP 4.19.11 is still using crun 1.22 (e.g. 4.19.11 uses crun-1.22-1.el9_6), so it was never affected.

blocks

OCPBUGS-60663 OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

Closed

links to

openshift/cincinnati-graph-data#7912: RUN-3446: blocked-edges/4.18.*-CrunConflictsWithNVIDIA: Declare risk

openshift/cincinnati-graph-data#7952: RUN-3446: blocked-edges/4.18.23-CrunConflictsWithNVIDIA: Fixed in 4.18.24

Assignee:: Giuseppe Scrivano

Reporter:: W. Trevor King

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/09/09 10:07 PM

Updated:: 2025/09/24 8:29 AM

Resolved:: 2025/09/15 4:23 PM

Details

Description

Which 4.y.z to 4.y'.z' updates increase vulnerability?

Which types of clusters?

What is the impact?

How involved is remediation?

Is this a regression?

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates