-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.15.z
-
No
-
False
-
Description of problem:
The autoscaler is not scaling, even though there are multiple pods attempting to schedule but cannot.
Version-Release number of selected component (if applicable):
This specific instance was on 4.15.3
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
There is one node in the `build` machineset that is set to autoscale from 0-120 nodes. The one node is using 80% of the CPU and new workloads cannot be scheduled on it. However, the autoscaler is showing that there is no need to scale even though there are pending pods. Specific log messages from the failed CI pipelines that seemingly contradict each other: * 2024-04-04T10:44:56Z 7x default-scheduler: 0/35 nodes are available: 1 Insufficient cpu, 16 node(s) had untolerated taint {node-role.kubernetes.io/tests: tests-worker}, 2 node(s) had untolerated taint {node-role.kubernetes.io/infra: }, 3 node(s) had untolerated taint {node-role.kubernetes.io/longtests-worker: longtests-worker}, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }, 4 node(s) didn't match Pod's node affinity/selector, 6 node(s) had untolerated taint {node-role.kubernetes.io/ci-prowjobs-worker: ci-prowjobs-worker}. preemption: 0/35 nodes are available: 1 No preemption victims found for incoming pod, 34 Preemption is not helpful for scheduling.. * 2024-04-04T10:59:59Z 6x cluster-autoscaler: pod didn't trigger scale-up: 1 node(s) had untolerated taint {node-role.kubernetes.io/longtests-worker: longtests-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/prowjobs-worker: prowjobs-worker}, 1 node(s) had untolerated taint {node-role.kubernetes.io/tests-worker: tests-worker}, 1 node(s) didn't match Pod's node affinity/selector, 1 not ready for scale-up Specifically what stands out to me: default-scheduler: 0/35 nodes are available: 1 Insufficient cpu cluster-autoscaler: pod didn't trigger scale-up: 1 not ready for scale-up
Expected results:
Because the CPU is insufficient to schedule the additional workloads it should scale the machineset.
Additional info:
As a workaround I had the team set a higher number of minimum nodes after taking a must gather.
- relates to
-
OCPBUGS-11115 Autoscaler does not work after entering in failed status for a single machineautoscaler
- ASSIGNED