-
Spike
-
Resolution: Done
-
Critical
-
None
-
None
-
None
-
3
-
False
-
-
False
-
-
Impact statement for OCPBUGS-60663:
Which 4.y.z to 4.y'.z' updates increase vulnerability?
Any updates into 4.18.22 or later, until the 4.18.z that ships the fix for OCPBUGS-60663. 4.19 and 4.17 are not affected.
Which types of clusters?
Clusters with the NVIDIA GPU Operator installed and using crun on GPU-hosting Nodes. PromQL that returns 1 on "exposed" and 0 on "not exposed, and the relevant metrics are working", and no-results on "relevant metrics are not working" is:
group by (name) (csv_succeeded{_id="", name=~"gpu-operator-certified[.].*"}) or on (_id) 0 * group(csv_count{_id=""})
gpu-operator-certified seems like a surprising ClusterServiceVersion name prefix for "NVIDIA GPU Operator", but that's the name mentioned in these docs, and it's what shows up in this CI run.
4.18 supports both crun and runc, but I'm not aware of in-cluster PromQL that could distinguish runc from crun Nodes.
What is the impact?
The nvidia-operator-validator Pod's container creation can fail, with logs mentioning nvidia-container-runtime-hook. Cluster users cannot run GPU-enabled workloads on impacted clusters.
How involved is remediation?
On an affected release, you can pivot GPU-hosting Nodes from crun to runc. Alternatively, update to a release with the fix for OCPBUGS-60663, or to a release that was never affected, like 4.19.
If you're already impacted by this please contact support.
Is this a regression?
Yes, 4.18.22 shipped crun 1.23, which began managing devices rules with eBPF, rather than through cgroups. This is more idiomatic with how systemd states devices should be managed. However, there was a gap in the SELinux policy that caused this management to fail. The long-term fix is an update to the container-selinux policy, which is shipped in version 2.235.0-3 of the package.
OCP 4.19.11 is still using crun 1.22 (e.g. 4.19.11 uses crun-1.22-1.el9_6), so it was never affected.
- blocks
-
OCPBUGS-60663 OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
-
- Closed
-
- links to