Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66249

OCP 4.19.19: GPU workloads with readOnlyRootFilesystem=true fail to run, show NVIDIA hook errors

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.18.z, 4.19.z, 4.20.z
    • Node / CRI-O
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • x86_64, aarch64
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Workloads that try to consume GPUs via the NVIDIA GPU Operator, and have `readOnlyRootFilesystem: true` in their spec, will fail with one of the following errors:
      
      * If CDI is enabled in GPU operator's ClusterPolicy:
      ```
      error executing hook `/usr/local/nvidia/toolkit/nvidia-cdi-hook` (exit code: 1)
      ```
      
      * If CDI is disabled in GPU operator's ClusterPolicy: 
      ```
      error executing hook `/usr/local/nvidia/toolkit/nvidia-container-runtime-hook` (exit code: 1)
      ```
      
      According to a customer who has a mutating webhook that adds `readOnlyRootFilesystem: true` for compliance reasons, the issue has started after the cluster was upgraded from 4.19.17 to 4.19.19.

      Version-Release number of selected component (if applicable):

      Known OpenShift versions 4.19.19, 4.18.27, 4.20.3

      How reproducible:

      I could easily reproduce it with CDI enabled on 4.19.17 and 4.19.19, but with CDI disabled everything works fine.

      Steps to Reproduce:

          1. Cluster with an NVIDIA GPU (e.g. g4dn.2xlarge on AWS)
          2. Install NFD operator + NodeFeatureDiscovery CR.
          3. Install the NVIDIA GPU operator 25.10 + ClusterPolicy.
          4. Run a pod that request "nvidia.com/gpu: 1" and has "readOnlyRootFilesystem: true" in its securityContext.

      Actual results:

      Pod status: CreateContainerError
          
      Events ("oc describe pod" command) show:
      Warning  Failed          9s (x2 over 10s)  kubelet            Error: container create failed: error executing hook `/usr/local/nvidia/toolkit/nvidia-cdi-hook` (exit code: 1)

      Expected results:

      The pod starts successfully

      Additional info:

      Test workload (based on https://github.com/wilicc/gpu-burn):
      
      ---
      apiVersion: v1
      kind: Namespace
      metadata:
        name: gpu-burn
      ---
      apiVersion: v1
      kind: Pod
      metadata:
        name: gpu-burn
        namespace: gpu-burn
      spec:
        containers:
          - name: gpu-burn
            image: quay.io/vemporop/gpu-burn:cuda12.1.0-ubi8
            command: ["/app/gpu_burn"]
            args: ["-m", "50%", "999999999"]
            resources:
              requests:
                nvidia.com/gpu: 1
              limits:
                nvidia.com/gpu: 1
            securityContext:
              seccompProfile:
                type: RuntimeDefault
              capabilities:
                drop:
                  - ALL
              runAsNonRoot: true
              allowPrivilegeEscalation: false
              readOnlyRootFilesystem: true

      Mitigation:

      * Downgrade the NVIDIA GPU Operator to 25.3
      * Explicitly disable CDI in ClusterPolicy `spec.cdi.enabled: false` (confirmed in some environments, but might not always work)
      * Switch low-level container runtime from `crun` to `runc`

              harpatil@redhat.com Harshal Patil
              rh-ee-vemporop Vitaliy Emporopulo
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: