-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.17.z
-
Quality / Stability / Reliability
-
False
-
-
3
-
Important
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Cluster autoscaler does not honour the PDB configuration when considering pods able to be evicted from a node. The cluster autoscaler not honouring the PDB configuration when we specify "unhealthyPodEvictionPolicy: AlwaysAllow". With this policy, it is always possible to evict unhealthy pods. However, the cluster autoscaler is not scaling down the nodes. We can see the following messages in the logs. ~~~ I1106 02:06:51.252082 1 klogx.go:87] Node ip-10-112-148-236.ap-southeast-2.compute.internal - cpu requested is 88.1567% of allocatable I1106 02:06:51.263127 1 cluster.go:156] Simulating node ip-10-112-148-236.ap-southeast-2.compute.internal removal I1106 02:06:51.263945 1 cluster.go:160] node ip-10-112-148-236.ap-southeast-2.compute.internal cannot be removed: not enough pod disruption budget to move namespace name/xxx-xxx-64c7f5b688-qz6zp ~~~ $ oc get pods -n a-xxxxx -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES aaaa-ccccc-api-64c7f5b688-qz6zp 0/1 ContainerCreating 0 15d <none> ip-10-112-181-263.ap-southeast-2.compute.internal <none> <none> ddddd-eeee-api-7d7cf55cdd-bsr6k 0/1 ContainerCreating 0 15d <none> ip-10-112-186-69.ap-southeast-2.compute.internal <none> <none> $ oc get pdb <pdb name> -o yaml -n a-xxxxxx apiVersion: policy/v1 kind: PodDisruptionBudget metadata: creationTimestamp: "2024-12-12T10:03:25Z" generation: 2 name: PDB name namespace: Namespace name resourceVersion: "3512536993" uid: b453b0a4-91d6-46df-93dd-9028595a9f77 spec: maxUnavailable: 1 selector: matchExpressions: - key: batch.kubernetes.io/job-name operator: DoesNotExist matchLabels: app: dev-testing-123 unhealthyPodEvictionPolicy: AlwaysAllow status: conditions: - lastTransitionTime: "2024-12-13T09:07:33Z" message: "" observedGeneration: 2 reason: InsufficientPods status: "False" type: DisruptionAllowed currentHealthy: 0 desiredHealthy: 0 disruptionsAllowed: 0 expectedPods: 1 observedGeneration: 2
Slack thread – https://redhat-internal.slack.com/archives/C02F1J9UJJD/p1762404970980579
Version-Release number of selected component (if applicable):
ROSA 4.17.42
How reproducible:
Steps to Reproduce:
1.Ensure that pods are stuck in the ImagePullBackOff state or in the container creating state.
2. Create PDB and specify "unhealthyPodEvictionPolicy: AlwaysAllow".
3. Ensure that CPU or memory request values are less than 50% and nothing else is preventing the cluster autoscaler from scaling down the nodes. The node should be elgible for scale down.
4. Check cluster-autoscaler-default pod logs for the following message: I1106 02:06:51.263945 1 cluster.go:160] node ip-10-112-148-236.ap-southeast-2.compute.internal cannot be removed: not enough pod disruption budget to move Namespace name/aaa--bbbbb-64c7f5b688-qz6zp
Actual results:
The cluster autoscaler is not scaling down the nodes because of the PDB.
Expected results:
The cluster autoscaler should scale down the nodes successfully.
Additional info: