Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.21.0
Affects Version/s: 4.18
Component/s: Cluster Autoscaler
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Critical
Regression:
None

Target Backport Versions:

4.18, 4.19, 4.20
Target Version:

4.21.0
Release Blocker:
None
Sprint:
AUTOSCALE - Sprint 276, AUTOSCALE - Sprint 277, AUTOSCALE - Sprint 278, AUTOSCALE - Sprint 279
sprint_count:
4

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, when a `MachineDeployment` was in the process of upgrading its machines and the Cluster Autoscaler was also scaling the `MachineDeployment`, the Cluster Autoscaler could remove new machines by scaling down the `MachineDeployment` for under-utilized nodes. With this release, scale down does not occur when a `MachineDeployment` is in the process of upgrading its machines. (link:https://issues.redhat.com/browse/OCPBUGS-60790[~~OCPBUGS-60790~~])

Show
Before this update, when a `MachineDeployment` was in the process of upgrading its machines and the Cluster Autoscaler was also scaling the `MachineDeployment`, the Cluster Autoscaler could remove new machines by scaling down the `MachineDeployment` for under-utilized nodes. With this release, scale down does not occur when a `MachineDeployment` is in the process of upgrading its machines. (link: https://issues.redhat.com/browse/OCPBUGS-60790 [ OCPBUGS-60790 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The Cluster Autoscaler CAPI provider incorrectly calculates safety constraints during MachineDeployment rolling updates, leading to massive over-deletion of machines.

Version-Release number of selected component (if applicable):

    4.18 (but we believe this on all the versions)

How reproducible:

  Often

Steps to Reproduce:

Create MachineDeployment with minSize=1 maxSize = n , assume current spec.replicas=20
Set maxUnavailable=0 for rolling update
Trigger rolling update (e.g., change machine template)
use PDB or K8s-Shredder to prevent old machine from old machineset deleting
For example 20 new machines be created, current 40 total machines
The workloads has migration from old machines to new version machinese (making old machines empty)
Observe Cluster Autoscaler expect to remove 19 empty nodes which stucking rolling upgrade
Result: MachineDeployment scales down to 1 machine, deleting 39 machinese total

Actual results:

Cluster Autoscaler calls SetSize(20-19) → reduces spec.replicas to 1
MachineDeployment controller sees spec.replicas=1 vs 40 actual machines
As old machine in old machineset will not be deleting, all new machine in new machineset were deleted expect 1.
MachineDeployment controller actually deletes 39 machines (not just the 19 empty ones)
Machinedeployment scales from 40 machines to 1 machine

Expected results:

Safety calculations should consider actual running machine count during rolling updates.

If CAS find the running empty nodes are from the old machines, they should considering it's pending delete, so no scale down actions to reduce the machinedeployment replicas

Additional info:

is cloned by

OCPBUGS-63495 Cluster AutoScaler + CAPI Provider incorrectly interferes with MachineDeployment Replicas

Closed

is depended on by

OCPBUGS-63495 Cluster AutoScaler + CAPI Provider incorrectly interferes with MachineDeployment Replicas

Closed

relates to

ACM-23449 TotalMachineSetsReplicaSum() double-counts machines during rolling updates, causing continually scale-down of New MachineSets

Closed

links to

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade

CAS long lived upgrading nodes problem (slides)

openshift/kubernetes-autoscaler#380: OCPBUGS-60790: refactor cloud provider options

(1 links to)

Assignee:: Michael McCune

Reporter:: Jude Zhu

QA Contact:: Paul Rozehnal

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2025/08/22 1:46 AM

Updated:: 2026/02/10 9:46 AM

Resolved:: 2026/02/10 9:46 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates