-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.12.0
-
None
-
Important
-
None
-
Proposed
-
False
-
Description of problem:
Driver Tool KIT image on OCP 4.12.0-0.nightly-2022-10-25-210451 has the following mismatch: WARNING: broken Driver Toolkit image detected: - Node kernel: 4.18.0-372.26.1.el8_6.x86_64 - Kernel package: 4.18.0-372.32.1.el8_6.x86_64 This is causing the GPU Operator Cluster Policy to fail deploy as the GPu drivers are not able to be built, and require cluster wide entitlement. # oc logs -n nvidia-gpu-operator nvidia-driver-daemonset-412.86.202210211031-0-jmbqp -c openshift-driver-toolkit-ctr + '[' -f /mnt/shared-nvidia-driver-toolkit/dir_prepared ']' + exec /mnt/shared-nvidia-driver-toolkit/ocp_dtk_entrypoint dtk-build-driver Running dtk-build-driver WARNING: broken Driver Toolkit image detected: - Node kernel: 4.18.0-372.26.1.el8_6.x86_64 - Kernel package: 4.18.0-372.32.1.el8_6.x86_64 INFO: informing nvidia-driver-ctr to fallback on entitled-build. INFO: nothing else to do in openshift-driver-toolkit-ctr container, sleeping forever.
Version-Release number of selected component (if applicable):
- oc version
Client Version: 4.12.0-0.nightly-2022-09-02-115151
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2022-10-25-210451
Kubernetes Version: v1.25.2+4bd0702
How reproducible:
Every time
Steps to Reproduce:
1. Deploy GPU operator form Operator hub on g4dn.xlarge OCP instance after deploying NFD to label the worker nodes 2. Deploy ClusterPolicy instance from OperatorHub 3. Check logs of the nvidia driver daemonset in the nvidia-gpu-operator namespace
Actual results:
# oc get pods -n nvidia-gpu-operator NAME READY STATUS RESTARTS AGE gpu-feature-discovery-ts8r2 0/1 Init:0/1 0 25m gpu-operator-64bcc5c7d-xqr86 1/1 Running 0 26m nvidia-container-toolkit-daemonset-nphkg 0/1 Init:0/1 0 25m nvidia-dcgm-exporter-jr7lf 0/1 Init:0/2 0 25m nvidia-dcgm-v246m 0/1 Init:0/1 0 25m nvidia-device-plugin-daemonset-nj26k 0/1 Init:0/1 0 25m nvidia-driver-daemonset-412.86.202210211031-0-jmbqp 1/2 CrashLoopBackOff 8 (3m25s ago) 25m nvidia-node-status-exporter-xzcrs 1/1 Running 0 25m nvidia-operator-validator-s2g6c 0/1 Init:0/4 0 25m
Expected results:
Entire GPU stack running and device plugins deployed on GPU worker node after GPu driver is built in DTK container
Additional info: