Uploaded image for project: 'RHEL'
  1. RHEL
  2. RHEL-148722

nvidia-container-runtime is unable to run ebpf programmes in the newer kernel versions

Linking RHIVOS CVEs to...Migration: Automation ...Sync from "Extern...XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • rhel-9.6
    • container-selinux
    • None
    • None
    • Moderate
    • rhel-container-tools
    • 3
    • False
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Unspecified
    • Unspecified
    • Unspecified
    • x86_64
    • None

      What were you trying to do that didn't work?

      We are trying to Install GPU Operator with cdi disabled on a Kubernetes node with RHEL 9 installed.

      We are on a RHEL 9.6 system with the kernel version "5.14.0-570.12.1.el9_6.x86_64". We are using Kubernetes and CRIO version v1.33. We have started observing these failures since upgrading the kernels to the newer versions, so we believe this is a regression.

      The last working RHEL 9.6 kernel version was "5.14.0-570.12.1.el9_6.x86_64". We have also observed that changing the versions of kubernetes, crio, crun and container-selinux did not seem to have any affect. The kernel version seems to be the most significant factor here

      What is the impact of this issue to you?

      The GPU Operator in non-CDI mode is unusable since upgrading the RHEL 9.6 kernels. This will affect all users of the GPU Operator on RHEL 9.

      Please provide the package NVR for which the bug is seen:

      How reproducible is this bug?:

      Steps to reproduce

      1. Set up a Kubernetes Cluster
      2. Install Helm
      3. Install GPU Operator with CDI disabled 
        ( append "–set cdi.disabled=false" to the helm install command referenced in the link)
      4.  

      Expected results

      All of the GPU Operator pods come up with no issues

      Actual results

       

      The nvidia-operator-validator pod fails to come up and its status is reported as "Init:CreateContainerError". The toolkit-validation container goes into CrashLoopBackoff with the following error

      time="2026-02-11T21:58:57Z" level=info msg="version: 84601875-amd64, commit: 8460187"                                                                                                                                                           
      toolkit is not ready                                                                                                                                                                                                                            
      time="2026-02-11T21:58:57Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH" 

      I have also attached the strace logs which provide more details on the exact failure

              pehunt@redhat.com Peter Hunt
              tariq.ibrahim Tariq Ibrahim
              Container Runtime Eng Bot Container Runtime Eng Bot
              Container Runtime Bugs Bot Container Runtime Bugs Bot
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: