Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-3068

Cluster Autoscaler scale-down stuck

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

       Hive spoke version:

      • quay.io/openshift-release-dev/ocp-release:4.21.0-x86_64 - Failed
      • registry.ci.openshift.org/ocp/release:4.22.0-0.nightly-2026-01-31-082403 - Failed
      • quay.io/openshift-release-dev/ocp-release:4.20.13-x86_64 - PASS!

      Hive spoke platform: AWS/Azure/GCP
      Steps to Reproduce:

      • Create a Hive ClusterDeployment with a MachinePool that has autoscaling (minReplicas=10, maxReplicas=12) on a multi-AZ platform (e.g. AWS 3 AZs).
      • Wait for cluster to install and scale to min 10 workers; MachineSets stabilize at [4, 3, 3].
      • Deploy a workload (e.g. a busybox deployment) that consumes capacity so the autoscaler scales up to 12 workers; MachineSets become [4, 4, 4].
      • Delete the busybox deployment. The autoscaler should scale the pool back down to min 10 workers.
      • Observed: Scale-down to 10 never occurs; worker count remains 12.

      Expected Result:

      • Cluster Autoscaler should consider scale-down for all worker nodes in all three MachineSets.
        • Two of the three MachineSets have min=3 (pool min 10 distributed as 4+3+3); CA should be able to scale down one node in each of those two MachineSets, reducing total workers from 12 to 10.

      Actual Result: 

      • Only 4 worker nodes (all in one AZ, e.g. us-east-2a) appear in CA logs as scale-down candidates; all 4 are skipped with "node group min size reached (current: 4, min: 4)".
        • The other 8 worker nodes (2b and 2c) never appear in CA scale-down logs; they are never considered for scale-down.
        • Worker count stays at 12.
      ---
      apiVersion: hive.openshift.io/v1
      kind: MachinePool
      metadata:
        annotations:
          kubectl.kubernetes.io/last-applied-configuration: |
            {"apiVersion":"hive.openshift.io/v1","kind":"MachinePool","metadata":{"annotations":{},"name":"hive-d3bcee57-e4af-4774-aed5-cc2d3fb140d7-worker","namespace":"hive-e2e2"},"spec":{"clusterDeploymentRef":{"name":"hive-d3bcee57-e4af-4774-aed5-cc2d3fb140d7"},"name":"worker","platform":{"aws":{"rootVolume":{"size":120,"type":"gp3"},"type":"m6a.large","userTags":{"expirationDate":"2026-02-02T14:22+00:00"}}},"replicas":2},"status":{}}
        creationTimestamp: "2026-02-02T10:23:12Z"
        finalizers:
        - hive.openshift.io/remotemachineset
        generation: 4
        name: hive-d3bcee57-e4af-4774-aed5-cc2d3fb140d7-worker
        namespace: hive-e2e2
        resourceVersion: "151790"
        uid: 3ec9f695-20e8-4cb2-a902-c7aeaba350ca
      spec:
        autoscaling:
          maxReplicas: 12
          minReplicas: 10
        clusterDeploymentRef:
          name: hive-d3bcee57-e4af-4774-aed5-cc2d3fb140d7
        name: worker
        platform:
          aws:
            rootVolume:
              size: 120
              type: gp3
            type: m6a.large
            userTags:
              expirationDate: 2026-02-02T14:22+00:00
      status:
        conditions:
        - lastProbeTime: "2026-02-02T11:18:33Z"
          lastTransitionTime: "2026-02-02T11:18:33Z"
          message: The MachinePool has sufficient replicas for each MachineSet
          reason: EnoughReplicas
          status: "False"
          type: NotEnoughReplicas
        - lastProbeTime: "2026-02-02T10:23:12Z"
          lastTransitionTime: "2026-02-02T10:23:12Z"
          message: Condition Initialized
          reason: Initialized
          status: Unknown
          type: NoMachinePoolNameLeasesAvailable
        - lastProbeTime: "2026-02-02T11:03:39Z"
          lastTransitionTime: "2026-02-02T11:03:39Z"
          message: Subnets are valid
          reason: ValidSubnets
          status: "False"
          type: InvalidSubnets
        - lastProbeTime: "2026-02-02T11:03:39Z"
          lastTransitionTime: "2026-02-02T11:03:39Z"
          message: The configuration is supported
          reason: ConfigurationSupported
          status: "False"
          type: UnsupportedConfiguration
        - lastProbeTime: "2026-02-02T11:03:39Z"
          lastTransitionTime: "2026-02-02T11:03:39Z"
          message: MachineSets generated successfully
          reason: MachineSetGenerationSucceeded
          status: "True"
          type: MachineSetsGenerated
        - lastProbeTime: "2026-02-02T11:03:40Z"
          lastTransitionTime: "2026-02-02T11:03:40Z"
          message: Resources synced successfully
          reason: SyncSucceeded
          status: "True"
          type: Synced
        controlledByReplica: 0
        machineSets:
        - maxReplicas: 4
          minReplicas: 4
          name: hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2a
          readyReplicas: 4
          replicas: 4
        - maxReplicas: 4
          minReplicas: 3
          name: hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2b
          readyReplicas: 4
          replicas: 4
        - maxReplicas: 4
          minReplicas: 3
          name: hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2c
          readyReplicas: 4
          replicas: 4
        replicas: 12
      
       % oc get machineset -n openshift-machine-api
      NAME                                            DESIRED   CURRENT   READY   AVAILABLE   AGE
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2a   4         4         4       4           144m
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2b   4         4         4       4           144m
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2c   4         4         4       4           144m
      
       % oc get machineset -n openshift-machine-api -o yaml | grep -A2 "cluster-api-autoscaler-node-group"
            machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "4"
            machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "4"
            machine.openshift.io/memoryMb: "8192"
            machine.openshift.io/vCPU: "2"
      --
            machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "4"
            machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "3"
            machine.openshift.io/memoryMb: "8192"
            machine.openshift.io/vCPU: "2"
      --
            machine.openshift.io/cluster-api-autoscaler-node-group-max-size: "4"
            machine.openshift.io/cluster-api-autoscaler-node-group-min-size: "3"
            machine.openshift.io/memoryMb: "8192"
            machine.openshift.io/vCPU: "2"
      
      
      % oc get machines -n openshift-machine-api -l machine.openshift.io/cluster-api-machine-role=worker -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.providerID}{"\n"}{end}'
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2a-2gfdr	aws:///us-east-2a/i-06114ebe632673c1f
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2a-5ztcd	aws:///us-east-2a/i-0080473c7fcbbc494
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2a-9cdt5	aws:///us-east-2a/i-007640463716880b8
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2a-cb87m	aws:///us-east-2a/i-004d698cba6a36b4f
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2b-4lx6l	aws:///us-east-2b/i-0f4b2007191ef1d7b
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2b-gz7vf	aws:///us-east-2b/i-0215bde3818c46c3f
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2b-rb4sz	aws:///us-east-2b/i-02309f82390801f8b
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2b-zkh4h	aws:///us-east-2b/i-0b3e9b5662ef189e3
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2c-dkcz5	aws:///us-east-2c/i-054d1f0e167f992bf
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2c-sw7cl	aws:///us-east-2c/i-089aaa86a3393ffbd
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2c-vdvsc	aws:///us-east-2c/i-0c79aab1402a171d2
      hive-d3bcee57-e4af-47-zjfg5-worker-us-east-2c-x2s8m	aws:///us-east-2c/i-0c9e314be2a30522c
      
       
      % oc logs -n openshift-machine-api deployment/cluster-autoscaler-default
      ...
      I0202 12:45:54.553076       1 node_instances_cache.go:156] Start refreshing cloud provider node instances cache
      I0202 12:45:54.567432       1 node_instances_cache.go:168] Refresh cloud provider node instances cache finished, refresh took 14.32094ms
      I0202 12:46:09.597348       1 static_autoscaler.go:520] No unschedulable pods
      I0202 12:46:10.199548       1 pre_filtering_processor.go:67] Skipping ip-10-0-4-162.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:10.199627       1 pre_filtering_processor.go:67] Skipping ip-10-0-25-85.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:10.199717       1 pre_filtering_processor.go:67] Skipping ip-10-0-5-182.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:10.199810       1 pre_filtering_processor.go:67] Skipping ip-10-0-11-57.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:28.404799       1 static_autoscaler.go:520] No unschedulable pods
      I0202 12:46:29.004736       1 pre_filtering_processor.go:67] Skipping ip-10-0-4-162.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:29.004833       1 pre_filtering_processor.go:67] Skipping ip-10-0-25-85.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:29.004908       1 pre_filtering_processor.go:67] Skipping ip-10-0-5-182.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:29.004985       1 pre_filtering_processor.go:67] Skipping ip-10-0-11-57.us-east-2.compute.internal - node group min size reached (current: 4, min: 4)
      I0202 12:46:47.210615       1 static_autoscaler.go:520] No unschedulable pods
      

       

      cluster-autoscaler-default.log

      cluster-autoscaler-operator.log

      ^clusterAutoscaler.yaml^

      4.20-pass.log

       

        1. 0204autoscaler-operator.log
          382 kB
        2. 0204-cluster-autoscaler-default.log
          7.66 MB
        3. 4.20-pass.log
          64 kB
        4. clusterAutoscaler.yaml
          0.5 kB
        5. cluster-autoscaler-default.log
          98 kB
        6. cluster-autoscaler-operator.log
          281 kB

              Unassigned Unassigned
              mihuang@redhat.com Mingxia Huang
              None
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: