Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14074

Pods with GPU request are failing to get scheduled since autoscaler is not scaling up the respective nodes

    XMLWordPrintable

Details

    • Important
    • No
    • CLOUD Sprint 239, CLOUD Sprint 240, CLOUD Sprint 241, CLOUD Sprint 242, CLOUD Sprint 243, CLOUD Sprint 244
    • 6
    • Rejected
    • False
    • Hide

      None

      Show
      None
    • Customer Escalated

    Description

      Description of problem:

      GPU enabled pod is failing to scale up the nodes using the cluster autoscaler
      Could see the following logs from the machine auto scaler pods,
      
      2023-05-23T08:49:52.701980777Z I0523 08:49:52.701962       1 non_csi.go:241] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found"
      2023-05-23T08:49:52.702048033Z I0523 08:49:52.702025       1 csi.go:99] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found"
      2023-05-23T08:49:52.702079470Z I0523 08:49:52.702073       1 binder.go:257] "FindPodVolumes" pod="aayyad-jupyter/jupyter-nb-vwfjtc4-0" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392"
      2023-05-23T08:49:52.702115215Z I0523 08:49:52.702092       1 binder.go:782] "Could not get a CSINode object for the node" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" err="csinode.storage.k8s.io \"template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392\" not found"
      2023-05-23T08:49:52.702200047Z I0523 08:49:52.702178       1 binder.go:802] "PersistentVolume and node mismatch for pod" PV="pvc-6292e6b5-d261-4b3a-a327-88d12150edbd" node="template-node-for-MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b-5679343275053522392" pod="aayyad-jupyter/jupyter-nb-vwfjtc4-0" err="no matching NodeSelectorTerms"
      2023-05-23T08:49:52.702223300Z I0523 08:49:52.702206       1 scale_up.go:299] Pod jupyter-nb-vwfjtc4-0 can't be scheduled on MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b, predicate checking error: node(s) had volume node affinity conflict; predicateName=VolumeBinding; reasons: node(s) had volume node affinity conflict; debugInfo=
      2023-05-23T08:49:52.702234853Z I0523 08:49:52.702226       1 scale_up.go:458] No pod can fit to MachineSet/openshift-machine-api/ocp4-7kt82-gpu-g5-4xlarge-eu-west-1b

      Version-Release number of selected component (if applicable):

      OCP version: 4.12.16

      Actual results:

      The node scale up is not working as expected and the pod stays back in pending state
      $ omg get pods
      NAME                               READY  STATUS   RESTARTS  AGE
      fast-api-minimal-844f4c99f4-wst5h  1/1    Running  0         1d
      jupyter-nb-vwfjtc4-0               0/2    Pending  0         29m <<---
      
      Events:
      23m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 1 node(s) had no available volume zone, 3 node(s) had volume node affinity conflict, 13 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {nodeforlogging: }
      22m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 2 node(s) had untolerated taint {nodeforlogging: }, 12 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 1 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had volume node affinity conflict
      20m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 2 node(s) had volume node affinity conflict, 13 node(s) didn't match Pod's node affinity/selector, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }, 1 node(s) had no available volume zone
      19m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 1 node(s) had untolerated taint {nodeforlogging: }, 14 node(s) didn't match Pod's node affinity/selector, 1 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had volume node affinity conflict
      18m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 1 Insufficient memory, 12 node(s) didn't match Pod's node affinity/selector, 3 node(s) had volume node affinity conflict, 2 node(s) had untolerated taint {nodeforlogging: }, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 1 node(s) had no available volume zone
      16m        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  pod didn't trigger scale-up: 9 node(s) didn't match Pod's node affinity/selector, 2 Insufficient memory, 2 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 5 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }
      18s        Normal   NotTriggerScaleUp     pod/jupyter-nb-vwfjtc4-0                  (combined from similar events): pod didn't trigger scale-up: 12 node(s) didn't match Pod's node affinity/selector, 1 node(s) had untolerated taint {nodeforlogging: }, 2 Insufficient nvidia.com/gpu, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had volume node affinity conflict, 1 Insufficient memory
      29m        Normal   SuccessfulCreate      statefulset/jupyter-nb-vwfjtc4            create Pod jupyter-nb-vwfjtc4-0 in StatefulSet jupyter-nb-vwfjtc4 successful
      7m13s      Warning  FailedScheduling      notebook/jupyter-nb-vwfjtc4               Reissued from pod/jupyter-nb-vwfjtc4-0: 0/15 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 3 node(s) had untolerated taint {nodeforlogging: }, 6 node(s) didn't match Pod's node affinity/selector. preemption: 0/15 nodes are available: 15 Preemption is not helpful for scheduling.
      29m        Normal   SuccessfulCreate      notebook/jupyter-nb-vwfjtc4               Reissued from statefulset/jupyter-nb-vwfjtc4: create Pod jupyter-nb-vwfjtc4-0 in StatefulSet jupyter-nb-vwfjtc4 successful
      27m        Normal   NotTriggerScaleUp     notebook/jupyter-nb-vwfjtc4               Reissued from pod/jupyter-nb-vwfjtc4-0: pod didn't trigger scale-up: 1 Insufficient nvidia.com/gpu, 1 node(s) had no available volume zone, 10 node(s) didn't match Pod's node affinity/selector, 3 node(s) had volume node affinity conflict, 1 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 2 node(s) had untolerated taint {nodeforlogging: }, 3 Insufficient memory
      

      Expected results:

      Scale-up of OpenShift Container Platform 4 - Node with available GPU to be triggered and pod eventually scheduled on the newly added OpenShift Container Platform 4 - Node

      Additional info:

      Attaching the must gather and inspect element from the namespace where the pod is facing the issue

      Slack Conversation: https://redhat-internal.slack.com/archives/CBZHF4DHC/p1685013880588029

      Reference Bugs: https://issues.redhat.com/browse/OCPBUGS-13541

      Attachments

        Issue Links

          Activity

            People

              mimccune@redhat.com Michael McCune
              rhn-support-vpavithr Vishnudutt Pavithran (Inactive)
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: