Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2257

ClusterAutoscaler with balanceSimilarNodeGroups does not work if scale from zero

XMLWordPrintable

    • Moderate
    • None
    • CLOUD Sprint 239, CLOUD Sprint 240, CLOUD Sprint 241, CLOUD Sprint 242, CLOUD Sprint 243
    • 5
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      ClusterAutoscaler configured with `balanceSimilarNodeGroups` set to `true`, if there are machinesets which scales up from 0, the autoscaler will first scale in these machinesets, after they are full, then other node groups.

      Version-Release number of selected component (if applicable):

      4.12.0-0.nightly-2022-10-05-053337

      How reproducible:

      always

      Steps to Reproduce:

      1. Create clusterautoscaler on gcp
      apiVersion: "autoscaling.openshift.io/v1"
      kind: "ClusterAutoscaler"
      metadata:
        name: "default"
      spec:
        balanceSimilarNodeGroups: true
        balancingIgnoredLabels: ["topology.gke.io/zone"]
        resourceLimits:
          maxNodesTotal: 20
        scaleDown:
          enabled: true
          delayAfterAdd: 10s
          delayAfterDelete: 10s
          delayAfterFailure: 10s
          unneededTime: 10s
      2. Create machineautoscalers, some machineset need scale from 0
      $ oc get machineset                                                                                                      
      NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
      zhsungcp10-lmfbm-worker-a   1         1         1       1           128m
      zhsungcp10-lmfbm-worker-b   0         0                             128m
      zhsungcp10-lmfbm-worker-c   1         1         1       1           128m
      zhsungcp10-lmfbm-worker-f   0         0                             128m
      
      $ oc get machineautoscaler                                                                                                 
      NAME                  REF KIND     REF NAME                    MIN   MAX   AGE
      machineautoscaler-a   MachineSet   zhsungcp10-lmfbm-worker-a   1     20    39m
      machineautoscaler-b   MachineSet   zhsungcp10-lmfbm-worker-b   0     19    39m
      machineautoscaler-c   MachineSet   zhsungcp10-lmfbm-worker-c   1     20    13s
      machineautoscaler-f   MachineSet   zhsungcp10-lmfbm-worker-f   0     19    39m
      3. Create workload
      4. Check machineset and log
      
      
      

      Actual results:

      If there are machinesets which scales up from 0, the autoscaler will first balance in these machinesets, after they are full, then scale in other node groups.
      
       I1010 08:39:48.865639       1 scale_up.go:481] Estimated 26 nodes needed in MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b
      I1010 08:39:48.865645       1 scale_up.go:486] Capping size to max cluster total size (30)
      I1010 08:39:49.605437       1 scale_up.go:591] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f}
      I1010 08:39:49.605472       1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 0->13 (max: 19)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f 0->12 (max: 19)}]
      I1010 08:39:49.605492       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b size to 13
      I1010 08:39:50.209449       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 12
      
      $ oc get machineset                                                                                        
      NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
      zhsungcp10-lmfbm-worker-a   1         1         1       1           130m
      zhsungcp10-lmfbm-worker-b   13        13                            130m
      zhsungcp10-lmfbm-worker-c   1         1         1       1           130m
      zhsungcp10-lmfbm-worker-f   12        12                            130m

      Expected results:

      Balance in all node groups.

      Additional info:

      Other testing on gcp: 
      $ oc get machineautoscaler                                                                                                                                                     
      NAME                  REF KIND     REF NAME                    MIN   MAX   AGE
      machineautoscaler-a   MachineSet   zhsungcp10-lmfbm-worker-a   1     10    3m41s
      machineautoscaler-b   MachineSet   zhsungcp10-lmfbm-worker-b   1     10    3m55s
      machineautoscaler-f   MachineSet   zhsungcp10-lmfbm-worker-f   0     10    4m15s
      
      Add workload:
      Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 10
      Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-a}
      I1010 07:46:11.862566       1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 1->3 (max: 10)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-a 1->3 (max: 10)}]
      
      --------
      $ oc get machineautoscaler                                                                                          
      NAME                  REF KIND     REF NAME                    MIN   MAX   AGE
      machineautoscaler-a   MachineSet   zhsungcp10-lmfbm-worker-a   1     10    22m
      machineautoscaler-b   MachineSet   zhsungcp10-lmfbm-worker-b   0     9     22m
      machineautoscaler-f   MachineSet   zhsungcp10-lmfbm-worker-f   0     9     22m
      
      Add workload:
       Capping size to max cluster total size (20)
      I1010 08:32:19.557095       1 scale_up.go:591] Splitting scale-up between 2 similar node groups: {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b, MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f}
      I1010 08:32:19.557132       1 scale_up.go:601] Final scale-up plan: [{MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b 0->8 (max: 9)} {MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f 0->7 (max: 9)}]
      I1010 08:32:19.557149       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-b size to 8
      I1010 08:32:20.161422       1 scale_up.go:700] Scale-up: setting group MachineSet/openshift-machine-api/zhsungcp10-lmfbm-worker-f size to 7
      
      $ oc get machineset                                                                   
      NAME                        DESIRED   CURRENT   READY   AVAILABLE   AGE
      zhsungcp10-lmfbm-worker-a   1         1         1       1           123m
      zhsungcp10-lmfbm-worker-b   8         8                             123m
      zhsungcp10-lmfbm-worker-c   1         1         1       1           123m
      zhsungcp10-lmfbm-worker-f   7         7                             123m

              mimccune@redhat.com Michael McCune
              rhn-support-zhsun Zhaohua Sun
              Zhaohua Sun Zhaohua Sun
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: