Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.10.z, 4.15.z
Component/s: Cloud Compute / Cluster Autoscaler
Labels:
- ServiceDeliveryImpact

Severity:
Moderate
Regression:
No
Story Points:
3
Sprint:
CLOUD Sprint 246, CLOUD Sprint 247, CLOUD Sprint 248, CLOUD Sprint 249, CLOUD Sprint 250, CLOUD Sprint 251, CLOUD Sprint 252, CLOUD Sprint 253, CLOUD Sprint 254, CLOUD Sprint 255, CLOUD Sprint 256, CLOUD Sprint 257, CLOUD Sprint 258, CLOUD Sprint 259, CLOUD Sprint 260, CLOUD Sprint 261, CLOUD Sprint 263, CLOUD Sprint 264, CLOUD Sprint 262, CLOUD Sprint 265, CLOUD Sprint 266, AUTOSCALE - Sprint 268, AUTOSCALE - Sprint 267
sprint_count:
23
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Customer Impact:

Customer Escalated, Customer Facing, Customer Reported

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:
PX Priority Data:
PX Impact Range:
PX Review Complete:

Description of problem:

On march 22 the autoscaler got in to a broken state with:

2023-03-22T12:46:49.148733289Z E0322 12:46:49.148726       1 static_autoscaler.go:364] Failed to fix node group sizes: failed to decrease MachineSet/openshift-machine-api/eu-3-compute-kgzn2-aro-machineset-compute-xl-germanywestcentral-1: attempt to delete existing nodes targetSize:4 delta:-1 existingNodes: 6

Accordingly to the code here :

https://github.com/openshift/kubernetes-autoscaler/blob/7aea306f3cd9951007d5c1b981bf3da770b52790/cluster-autoscaler/core/static_autoscaler.go#L421

Once it enters in the conditional of failure in the above IF... It won't recover to check for unschedulable pods present in this IF ELSE- function

https://github.com/openshift/kubernetes-autoscaler/blob/7aea306f3cd9951007d5c1b981bf3da770b52790/cluster-autoscaler/core/static_autoscaler.go#L509

UNTIL it gets fixed.

This can be seen below, where it stopped working at 2023-03-22T12:46 by not checking for pods that are not unschedulable and resumed just on March 23 at 19:14 UTC

So tldr; Once the quota issue was resolved (this was the specific error happening for the customer) between 19:05 ({}Last time it was seen a Quota Error{}) - 19:14, the autoscaler got fixed, and resumed to work.

I0323 19:15:21.925996       1 static_autoscaler.go:419] No unschedulable pods
I0323 19:15:48.151241       1 static_autoscaler.go:419] No unschedulable pods
I0323 19:14:00.640484       1 klogx.go:86] Pod eu-3-compute/compute-customer-27-7b6b98989-vrl7g is unschedulable
I0323 19:14:31.471670       1 static_autoscaler.go:419] No unschedulable pods
I0323 19:14:56.698237       1 static_autoscaler.go:419] No unschedulable pods
I0322 12:46:23.716449       1 static_autoscaler.go:419] No unschedulable pods << last entry on March 22th
I0322 12:45:08.041970       1 static_autoscaler.go:419] No unschedulable pods
I0322 12:45:33.267542       1 static_autoscaler.go:419] No unschedulable pods
I0322 12:45:58.492504       1 static_autoscaler.go:419] No unschedulable pods

Actual results:

Autoscaler does not work when gets into an error for a machine.

Expected results:

1. Entering an error for a machineautoscaler is OK. But the expectation here is that it will continuously work in case there are other healthy machineautoscalers in healthy mode.


2. Also, the error message is not helpful. While doing further testing after the error happened, the autoscaler was not being triggered, and the message does not says that it is expected to be fixed in order to be able to run again.

is duplicated by

OCPBUGS-37263 Investigate Cluster Autoscaler: Failed to fix node group sizes: failed to decrease

Closed

is related to

OCPBUGS-31760 Cluster Autoscaler Not Scaling

ASSIGNED

OCPSTRAT-1106 Enable priority and least-waste expanders for cluster autoscaler

Closed

relates to

OCPBUGS-37263 Investigate Cluster Autoscaler: Failed to fix node group sizes: failed to decrease

Closed

links to

openshift/kubernetes-autoscaler#278: OCPBUGS-11115: improve replica counting on openshift

openshift/kubernetes-autoscaler#343: OCPBUGS-11115: make DecreaseTargetSize more accurate

(1 links to)

Assignee:: Michael McCune

Reporter:: Hevellyn Gomes

QA Contact:: Zhaohua Sun

Votes:: 3 Vote for this issue

Watchers:: 28 Start watching this issue

Created:: 2023/03/30 9:10 AM

Updated:: 2025/03/06 3:12 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates