-
Bug
-
Resolution: Won't Do
-
Critical
-
None
-
None
-
None
-
Incidents & Support
-
False
-
-
False
-
-
-
Critical
-
None
Description of problem:
The TotalMachineSetsReplicaSum() function in the MachineDeployment controller contains a logic flaw that causes it to double-count machines when they are stuck in a deleting state. This results in aggressive and inappropriate scale-down of healthy new MachineSets during rolling updates, potentially causing significant capacity loss in production clusters.
The function uses max(ms.Spec.Replicas, ms.Status.Replicas) to count machines per MachineSet. While this logic works for normal scale up/down scenarios, it fails catastrophically when machines are stuck deleting:Location: internal/controllers/machinedeployment/mdutil/util.go:519
totalReplicas += max(*(ms.Spec.Replicas), ms.Status.Replicas)
Problem: When a machine is stuck deleting:
- Spec.Replicas decreases (desired state after scale-down)
- Status.Replicas remains high (actual machines still exist)
- max() always picks the higher value, effectively counting the stuck machine twice
Version-Release number of selected component (if applicable):
ROSA HCP MP upgrade from 4.17.25 to 4.18.19
How reproducible:
Often when customer ROSA HCP upgrade with Parked Nodes strategy
Steps to Reproduce:
- Create a MachineDeployment with 20 replicas and rolling update strategy
- Trigger a rolling update (change machine template)
- Simulate stuck machine deletion by:
- Setting restrictive PodDisruptionBudgets, OR
- Introducing infrastructure provider delays, OR
- Network issues preventing proper cleanup
- Observe the old MachineSet attempting to scale down but with machines stuck in deleting state
- Monitor the new MachineSet - it will begin scale-down until 1 but all stucked deleting because PDBs
Actual results:
- New MachineSet is continually scaled down (20→13→11→10 .... > →1 replicas observed)
- Customer workload on the New version machineset have been force evicted
- After resolve the old machine stuck, even all old version machine have been gone, many new machines having Deleting status
Expected results:
With Parked Node strategy
- New MachineSet should scale up to full capacity regardless of stuck machines in old MachineSet
- If need to scale down, it should scale down the old MachineSet or leave them be
- Total capacity should remain stable during rolling update
- Stuck deleting old machines should not affect scaling decisions for New MachineSets
Additional info:
Timeline: Old MachineSet stuck, New MachineSet aggressively scaled down09:15:11 - Old n8vlx: "scaling down to 19 replicas" machineCount=20 (1 stuck) 09:15:14 - New 5kf6x: "scaling down to 13 replicas" machineCount=20 (AGGRESSIVE!) 09:15:26 - Old n8vlx: "scaling down to 19 replicas" machineCount=20 (STILL stuck) 09:15:29 - New 5kf6x: "scaling down to 13 replicas" machineCount=20 (REPEATS) 09:16:16 - New 5kf6x: "scaling down to 12 replicas" machineCount=20 (WORSE) 09:16:17 - New 5kf6x: "scaling down to 11 replicas" machineCount=20 (CONTINUES) 09:16:18 - New 5kf6x: "scaling down to 10 replicas" machineCount=20 (Continue) ... Eventually 5kf6x will be scale down to 1 having 15 stuck Deleting Machines due to PDB blocks
I think the bug or logical might be here
reconcileNewMachineSet()
→ NewMSNewReplicas()
→ TotalMachineSetsReplicaSum() ← BUG HERE
→ max(Spec.Replicas, Status.Replicas) ← Double-counts stuck machines
- is related to
-
OCPBUGS-60790 Cluster AutoScaler + CAPI Provider incorrectly interferes with MachineDeployment Replicas
-
- ASSIGNED
-