Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2998

OCP 4.12 Driver Toolkit (DTK) mismatch in kernel package and node kernel versions

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.12.0
    • Driver Toolkit
    • None
    • Important
    • Proposed
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Driver Tool KIT image on OCP 4.12.0-0.nightly-2022-10-25-210451  has the following mismatch:
      
      WARNING: broken Driver Toolkit image detected: 
      - Node kernel: 4.18.0-372.26.1.el8_6.x86_64 
      - Kernel package: 4.18.0-372.32.1.el8_6.x86_64
      
      This is causing the GPU Operator Cluster Policy to fail deploy as the GPu drivers are not able to be built, and require cluster wide entitlement.
      
      # oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-412.86.202210211031-0-jmbqp -c openshift-driver-toolkit-ctr
      + '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']'
      + exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver
      Running dtk-build-driver
      WARNING: broken Driver Toolkit image detected:
      - Node kernel:    4.18.0-372.26.1.el8_6.x86_64
      - Kernel package: 4.18.0-372.32.1.el8_6.x86_64
      INFO: informing nvidia-driver-ctr to fallback on entitled-build.
      INFO: nothing else to do in openshift-driver-toolkit-ctr container, sleeping forever.

      Version-Release number of selected component (if applicable):

      1. oc version

      Client Version: 4.12.0-0.nightly-2022-09-02-115151

      Kustomize Version: v4.5.4

      Server Version: 4.12.0-0.nightly-2022-10-25-210451

      Kubernetes Version: v1.25.2+4bd0702

      How reproducible:

      Every time

      Steps to Reproduce:

      1.  Deploy GPU operator form Operator hub on g4dn.xlarge OCP instance after deploying NFD to label the worker nodes
      2. Deploy ClusterPolicy instance from OperatorHub
      3. Check logs of the nvidia driver daemonset in the nvidia-gpu-operator namespace
      

      Actual results:

      # oc get pods -n nvidia-gpu-operator
      NAME                         READY  STATUS       RESTARTS    AGE
      gpu-feature-discovery-ts8r2              0/1   Init:0/1      0        25m
      gpu-operator-64bcc5c7d-xqr86             1/1   Running      0        26m
      nvidia-container-toolkit-daemonset-nphkg       0/1   Init:0/1      0        25m
      nvidia-dcgm-exporter-jr7lf              0/1   Init:0/2      0        25m
      nvidia-dcgm-v246m                   0/1   Init:0/1      0        25m
      nvidia-device-plugin-daemonset-nj26k         0/1   Init:0/1      0        25m
      nvidia-driver-daemonset-412.86.202210211031-0-jmbqp  1/2   CrashLoopBackOff  8 (3m25s ago)  25m
      nvidia-node-status-exporter-xzcrs           1/1   Running      0        25m
      nvidia-operator-validator-s2g6c            0/1   Init:0/4      0        25m

      Expected results:

      Entire GPU stack running and device plugins deployed on GPU worker node after GPu driver is built in DTK container 

      Additional info:

       

            ybettan@redhat.com Yoni Bettan
            walid@redhat.com Walid Abouhamad
            Lital Alon Lital Alon
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: