-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
4.18
-
None
-
Quality / Stability / Reliability
-
False
-
-
3
-
Critical
-
None
-
None
-
None
-
None
-
AUTOSCALE - Sprint 276, AUTOSCALE - Sprint 277, AUTOSCALE - Sprint 278
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
- The Cluster Autoscaler CAPI provider incorrectly calculates safety constraints during MachineDeployment rolling updates, leading to massive over-deletion of machines.
Version-Release number of selected component (if applicable):
4.18 (but we believe this on all the versions)
How reproducible:
Often
Steps to Reproduce:
- Create MachineDeployment with minSize=1 maxSize = n , assume current spec.replicas=20
- Set maxUnavailable=0 for rolling update
- Trigger rolling update (e.g., change machine template)
- use PDB or K8s-Shredder to prevent old machine from old machineset deleting
- For example 20 new machines be created, current 40 total machines
- The workloads has migration from old machines to new version machinese (making old machines empty)
- Observe Cluster Autoscaler expect to remove 19 empty nodes which stucking rolling upgrade
- Result: MachineDeployment scales down to 1 machine, deleting 39 machinese total
Actual results:
- Cluster Autoscaler calls SetSize(20-19) → reduces spec.replicas to 1
- MachineDeployment controller sees spec.replicas=1 vs 40 actual machines
- As old machine in old machineset will not be deleting, all new machine in new machineset were deleted expect 1.
- MachineDeployment controller actually deletes 39 machines (not just the 19 empty ones)
- Machinedeployment scales from 40 machines to 1 machine
Expected results:
Safety calculations should consider actual running machine count during rolling updates. If CAS find the running empty nodes are from the old machines, they should considering it's pending delete, so no scale down actions to reduce the machinedeployment replicas
Additional info:
- relates to
-
ACM-23449 TotalMachineSetsReplicaSum() double-counts machines during rolling updates, causing continually scale-down of New MachineSets
-
- Closed
-
- links to