Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-31760

Cluster Autoscaler Not Scaling

XMLWordPrintable

    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      The autoscaler is not scaling, even though there are multiple pods attempting to schedule but cannot.
          

      Version-Release number of selected component (if applicable):

      This specific instance was on 4.15.3
          

      How reproducible:

      
          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

      There is one node in the `build` machineset that is set to autoscale from 0-120 nodes. The one node is using 80% of the CPU and new workloads cannot be scheduled on it. However, the autoscaler is showing that there is no need to scale even though there are pending pods.
      
      Specific log messages from the failed CI pipelines that seemingly contradict each other:
      
      * 2024-04-04T10:44:56Z 7x default-scheduler: 0/35 nodes are available: 1 Insufficient cpu, 16 node(s) had untolerated taint {node-role.kubernetes.io/tests: tests-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/longtests-worker: longtests-worker}, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) didn't match Pod's node affinity/selector, 6 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}. preemption: 0/35 nodes are available: 1 No preemption victims found for incoming pod, 34 Preemption is not helpful for scheduling..
      * 2024-04-04T10:59:59Z 6x cluster-autoscaler: pod didn't trigger scale-up: 1 node(s) had untolerated taint {node-role.kubernetes.io/longtests-worker: longtests-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/prowjobs-worker: prowjobs-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/tests-worker: tests-worker}, 1 node(s) didn't match Pod's node affinity/selector, 1 not ready for scale-up
      
      Specifically what stands out to me:
      
      default-scheduler: 0/35 nodes are available: 1 Insufficient cpu
      
      cluster-autoscaler: pod didn't trigger scale-up: 1 not ready for scale-up
          

      Expected results:

      Because the CPU is insufficient to schedule the additional workloads it should scale the machineset.
          

      Additional info:

      As a workaround I had the team set a higher number of minimum nodes after taking a must gather.
          

              rh-ee-tbarberb Theo Barber-Bany
              iamkirkbater Kirk Bater
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: