Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.17.z
Component/s: Cluster Autoscaler
Labels:
- autoscaler
- pdb

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Cluster autoscaler does not honour the PDB configuration when considering pods able to be evicted from a node. The cluster autoscaler not honouring the PDB configuration when we specify "unhealthyPodEvictionPolicy: AlwaysAllow". With this policy, it is always possible to evict unhealthy pods. However, the cluster autoscaler is not scaling down the nodes.

We can see the following messages in the logs.

~~~
I1106 02:06:51.252082       1 klogx.go:87] Node ip-10-112-148-236.ap-southeast-2.compute.internal - cpu requested is 88.1567% of allocatable

I1106 02:06:51.263127       1 cluster.go:156] Simulating node 
ip-10-112-148-236.ap-southeast-2.compute.internal removal

I1106 02:06:51.263945       1 cluster.go:160] node ip-10-112-148-236.ap-southeast-2.compute.internal cannot be removed: not enough pod disruption budget to move namespace name/xxx-xxx-64c7f5b688-qz6zp
~~~

$ oc get pods -n a-xxxxx -o wide
NAME                                 READY   STATUS              RESTARTS   AGE     IP               NODE                                                NOMINATED NODE   READINESS GATES
aaaa-ccccc-api-64c7f5b688-qz6zp     0/1     ContainerCreating   0          15d     <none>           ip-10-112-181-263.ap-southeast-2.compute.internal   <none>           <none>
ddddd-eeee-api-7d7cf55cdd-bsr6k     0/1     ContainerCreating   0          15d     <none>           ip-10-112-186-69.ap-southeast-2.compute.internal    <none>           <none>


$ oc get pdb <pdb name> -o yaml -n a-xxxxxx
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  creationTimestamp: "2024-12-12T10:03:25Z"
  generation: 2
  name: PDB name
  namespace: Namespace name
  resourceVersion: "3512536993"
  uid: b453b0a4-91d6-46df-93dd-9028595a9f77
spec:
  maxUnavailable: 1
  selector:
    matchExpressions:
    - key: batch.kubernetes.io/job-name
      operator: DoesNotExist
    matchLabels:
      app: dev-testing-123
  unhealthyPodEvictionPolicy: AlwaysAllow
status:
  conditions:
  - lastTransitionTime: "2024-12-13T09:07:33Z"
    message: ""
    observedGeneration: 2
    reason: InsufficientPods
    status: "False"
    type: DisruptionAllowed
  currentHealthy: 0
  desiredHealthy: 0
  disruptionsAllowed: 0
  expectedPods: 1
  observedGeneration: 2

Slack thread – https://redhat-internal.slack.com/archives/C02F1J9UJJD/p1762404970980579

Version-Release number of selected component (if applicable):

    ROSA 4.17.42

How reproducible:

Steps to Reproduce:

1.Ensure that pods are stuck in the ImagePullBackOff state or in the container creating state.
   
2. Create PDB and specify "unhealthyPodEvictionPolicy: AlwaysAllow".

3. Ensure that CPU or memory request values are less than 50% and nothing else is preventing the cluster autoscaler from scaling down the nodes. The node should be elgible for scale down.

4. Check cluster-autoscaler-default pod logs for the following message: I1106 02:06:51.263945       1 cluster.go:160] node ip-10-112-148-236.ap-southeast-2.compute.internal cannot be removed: not enough pod disruption budget to move Namespace name/aaa--bbbbb-64c7f5b688-qz6zp

Actual results:

    The cluster autoscaler is not scaling down the nodes because of the PDB.

Expected results:

    The cluster autoscaler should scale down the nodes successfully.

Additional info:

Assignee:: Michael McCune

Reporter:: Abhishek Sheth

Need Info From:: None

Contributors:: None

QA Contact:: Paul Rozehnal

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/11/06 7:52 AM

Updated:: 2026/02/04 9:50 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates