Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10980

Fail to deploy GPU operator due to DTK broken

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • None
    • 4.13.0
    • Driver Toolkit
    • Critical
    • Yes
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      When deploy GPU operator on OCP4.13.0-rc1 on BM, the GPU Operator pod 
      nvidia-driver-daemonset-413.92.202303190222-0-fwmcq throw below warning, it cause the nvidia-driver fail to deploy
      
      WARNING: broken driver toolkit detected, using entitlement-based fallback
      
      [root@openshift-qe-018 ~]# oc get pods -n nvidia-gpu-operator
      E0328 10:30:48.787723 2435120 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
      NAME                                                  READY   STATUS             RESTARTS          AGE
      gpu-feature-discovery-8b5lt                           0/1     Init:0/1           0                 26h
      gpu-operator-d75d4dcb5-85sg5                          1/1     Running            0                 28h
      nvidia-container-toolkit-daemonset-bn7d2              0/1     Init:0/1           0                 26h
      nvidia-dcgm-exporter-fhs26                            0/1     Init:0/2           0                 26h
      nvidia-device-plugin-daemonset-zplqh                  0/1     Init:0/1           0                 26h
      nvidia-driver-daemonset-413.92.202303190222-0-fwmcq   1/2     CrashLoopBackOff   275 (2m27s ago)   28h
      nvidia-operator-validator-xwtn8                       0/1     Init:0/4           0                 26h

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1. Setup OCP4.13.0-rc1
      2. Deploy GPU Operator
      3.
      

      Actual results:

      The GPU Operator fail to deploy

      Expected results:

      The GPU Operator deploy successfully

      Additional info:

       

            ybettan@redhat.com Yoni Bettan
            rhn-support-liqcui Liquan Cui
            Lital Alon Lital Alon
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: