Loading...

XML

Word

Printable

Type: Bug
Resolution: Cannot Reproduce
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.15.z
Component/s: Cluster Autoscaler
Labels:
- SREsDevImpact-None
- SREsPerCoreImpact-Low

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

The autoscaler is not scaling, even though there are multiple pods attempting to schedule but cannot.

Version-Release number of selected component (if applicable):

This specific instance was on 4.15.3

How reproducible:

Steps to Reproduce:

    1.
    2.
    3.

Actual results:

There is one node in the `build` machineset that is set to autoscale from 0-120 nodes. The one node is using 80% of the CPU and new workloads cannot be scheduled on it. However, the autoscaler is showing that there is no need to scale even though there are pending pods.

Specific log messages from the failed CI pipelines that seemingly contradict each other:

* 2024-04-04T10:44:56Z 7x default-scheduler: 0/35 nodes are available: 1 Insufficient cpu, 16 node(s) had untolerated taint {node-role.kubernetes.io/tests: tests-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/longtests-worker: longtests-worker}, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) didn't match Pod's node affinity/selector, 6 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}. preemption: 0/35 nodes are available: 1 No preemption victims found for incoming pod, 34 Preemption is not helpful for scheduling..
* 2024-04-04T10:59:59Z 6x cluster-autoscaler: pod didn't trigger scale-up: 1 node(s) had untolerated taint {node-role.kubernetes.io/longtests-worker: longtests-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/prowjobs-worker: prowjobs-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/tests-worker: tests-worker}, 1 node(s) didn't match Pod's node affinity/selector, 1 not ready for scale-up

Specifically what stands out to me:

default-scheduler: 0/35 nodes are available: 1 Insufficient cpu

cluster-autoscaler: pod didn't trigger scale-up: 1 not ready for scale-up

Expected results:

Because the CPU is insufficient to schedule the additional workloads it should scale the machineset.

Additional info:

As a workaround I had the team set a higher number of minimum nodes after taking a must gather.

relates to

OCPBUGS-11115 Autoscaler does not work after entering in failed status for a single machineautoscaler

Closed

Assignee:: Ryan Phillips

Reporter:: Kirk Bater

Need Info From:: None

Contributors:: None

QA Contact:: Paul Rozehnal

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2024/04/04 5:30 PM

Updated:: 2025/10/01 8:31 PM

Resolved:: 2025/10/01 8:31 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide