Uploaded image for project: 'Container Tools'
  1. Container Tools
  2. RUN-3446

Impact OCP 4.18.22: NVIDIA GPU Operator v25.3.2 - nvidia-operator-validator pod in Init:CreateContainerError - error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)

XMLWordPrintable

    • Icon: Spike Spike
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • None
    • None
    • 3
    • False
    • Hide

      None

      Show
      None
    • False

      Impact statement for OCPBUGS-60663:

      Which 4.y.z to 4.y'.z' updates increase vulnerability?

      Any updates into 4.18.22 or later, until the 4.18.z that ships the fix for OCPBUGS-60663.  4.19 and 4.17 are not affected.

      Which types of clusters?

      Clusters with the NVIDIA GPU Operator installed and using crun on GPU-hosting Nodes. PromQL that returns 1 on "exposed" and 0 on "not exposed, and the relevant metrics are working", and no-results on "relevant metrics are not working" is:

      group by (name) (csv_succeeded{_id="", name=~"gpu-operator-certified[.].*"})
      or on (_id)
      0 * group(csv_count{_id=""})
      

      gpu-operator-certified seems like a surprising ClusterServiceVersion name prefix for "NVIDIA GPU Operator", but that's the name mentioned in these docs, and it's what shows up in this CI run.

      4.18 supports both crun and runc, but I'm not aware of in-cluster PromQL that could distinguish runc from crun Nodes.

      What is the impact?

      The nvidia-operator-validator Pod's container creation can fail, with logs mentioning nvidia-container-runtime-hook. Cluster users cannot run GPU-enabled workloads on impacted clusters.

      How involved is remediation?

      On an affected release, you can pivot GPU-hosting Nodes from crun to runc.  Alternatively, update to a release with the fix for OCPBUGS-60663, or to a release that was never affected, like 4.19.

      If you're already impacted by this please contact support.

      Is this a regression?

      Yes, 4.18.22 shipped crun 1.23, which began managing devices rules with eBPF, rather than through cgroups. This is more idiomatic with how systemd states devices should be managed. However, there was a gap in the SELinux policy that caused this management to fail. The long-term fix is an update to the container-selinux policy, which is shipped in version 2.235.0-3 of the package.

      OCP 4.19.11 is still using crun 1.22 (e.g. 4.19.11 uses crun-1.22-1.el9_6), so it was never affected.

       

              gscrivan@redhat.com Giuseppe Scrivano
              trking W. Trevor King
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: