Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Critical
Fix Version/s: None
Affects Version/s: None
Component/s: CAPI
Labels:
None

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Critical

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Description of problem:

The TotalMachineSetsReplicaSum() function in the MachineDeployment controller contains a logic flaw that causes it to double-count machines when they are stuck in a deleting state. This results in aggressive and inappropriate scale-down of healthy new MachineSets during rolling updates, potentially causing significant capacity loss in production clusters.

The function uses max(ms.Spec.Replicas, ms.Status.Replicas) to count machines per MachineSet. While this logic works for normal scale up/down scenarios, it fails catastrophically when machines are stuck deleting:Location: internal/controllers/machinedeployment/mdutil/util.go:519

totalReplicas += max(*(ms.Spec.Replicas), ms.Status.Replicas)

Problem: When a machine is stuck deleting:

Spec.Replicas decreases (desired state after scale-down)
Status.Replicas remains high (actual machines still exist)
max() always picks the higher value, effectively counting the stuck machine twice

Version-Release number of selected component (if applicable):

ROSA HCP MP upgrade from 4.17.25 to 4.18.19

How reproducible:

Often when customer ROSA HCP upgrade with Parked Nodes strategy

Steps to Reproduce:

Create a MachineDeployment with 20 replicas and rolling update strategy

Trigger a rolling update (change machine template)

Simulate stuck machine deletion by:

Setting restrictive PodDisruptionBudgets, OR

Introducing infrastructure provider delays, OR

Network issues preventing proper cleanup

Observe the old MachineSet attempting to scale down but with machines stuck in deleting state

Monitor the new MachineSet - it will begin scale-down until 1 but all stucked deleting because PDBs

Actual results:

New MachineSet is continually scaled down (20→13→11→10 .... > →1 replicas observed)

Customer workload on the New version machineset have been force evicted

After resolve the old machine stuck, even all old version machine have been gone, many new machines having Deleting status

Expected results:

With Parked Node strategy

New MachineSet should scale up to full capacity regardless of stuck machines in old MachineSet
If need to scale down, it should scale down the old MachineSet or leave them be

Total capacity should remain stable during rolling update

Stuck deleting old machines should not affect scaling decisions for New MachineSets

Additional info:

Timeline: Old MachineSet stuck, New MachineSet aggressively scaled down09:15:11 - Old n8vlx: "scaling down to 19 replicas" machineCount=20 (1 stuck)
09:15:14 - New 5kf6x: "scaling down to 13 replicas" machineCount=20 (AGGRESSIVE!)
09:15:26 - Old n8vlx: "scaling down to 19 replicas" machineCount=20 (STILL stuck) 
09:15:29 - New 5kf6x: "scaling down to 13 replicas" machineCount=20 (REPEATS)
09:16:16 - New 5kf6x: "scaling down to 12 replicas" machineCount=20 (WORSE)
09:16:17 - New 5kf6x: "scaling down to 11 replicas" machineCount=20 (CONTINUES)
09:16:18 - New 5kf6x: "scaling down to 10 replicas" machineCount=20 (Continue) 
...

Eventually 5kf6x will be scale down to 1 having 15 stuck Deleting Machines due to PDB blocks

I think the bug or logical might be here

reconcileNewMachineSet()
→ NewMSNewReplicas()
→ TotalMachineSetsReplicaSum() ← BUG HERE
→ max(Spec.Replicas, Status.Replicas) ← Double-counts stuck machines

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

table-data.csv
814 kB
2025/08/20 3:35 AM

is related to

OCPBUGS-60790 Cluster AutoScaler + CAPI Provider incorrectly interferes with MachineDeployment Replicas

Verified

Assignee:: Mohamed ElSerngawy

Reporter:: Jude Zhu

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2025/08/20 3:30 AM

Updated:: 2025/10/08 5:36 PM

Resolved:: 2025/08/26 4:18 PM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates