Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-64722

Cluster autoscaler does not honour the PDB configuration when considering pods able to be evicted from a node

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Cluster autoscaler does not honour the PDB configuration when considering pods able to be evicted from a node. The cluster autoscaler not honouring the PDB configuration when we specify "unhealthyPodEvictionPolicy: AlwaysAllow". With this policy, it is always possible to evict unhealthy pods. However, the cluster autoscaler is not scaling down the nodes.
      
      We can see the following messages in the logs.
      
      ~~~
      I1106 02:06:51.252082       1 klogx.go:87] Node ip-10-112-148-236.ap-southeast-2.compute.internal - cpu requested is 88.1567% of allocatable
      
      I1106 02:06:51.263127       1 cluster.go:156] Simulating node 
      ip-10-112-148-236.ap-southeast-2.compute.internal removal
      
      I1106 02:06:51.263945       1 cluster.go:160] node ip-10-112-148-236.ap-southeast-2.compute.internal cannot be removed: not enough pod disruption budget to move namespace name/xxx-xxx-64c7f5b688-qz6zp
      ~~~
      
      $ oc get pods -n a-xxxxx -o wide
      NAME                                 READY   STATUS              RESTARTS   AGE     IP               NODE                                                NOMINATED NODE   READINESS GATES
      aaaa-ccccc-api-64c7f5b688-qz6zp     0/1     ContainerCreating   0          15d     <none>           ip-10-112-181-263.ap-southeast-2.compute.internal   <none>           <none>
      ddddd-eeee-api-7d7cf55cdd-bsr6k     0/1     ContainerCreating   0          15d     <none>           ip-10-112-186-69.ap-southeast-2.compute.internal    <none>           <none>
      
      
      $ oc get pdb <pdb name> -o yaml -n a-xxxxxx
      apiVersion: policy/v1
      kind: PodDisruptionBudget
      metadata:
        creationTimestamp: "2024-12-12T10:03:25Z"
        generation: 2
        name: PDB name
        namespace: Namespace name
        resourceVersion: "3512536993"
        uid: b453b0a4-91d6-46df-93dd-9028595a9f77
      spec:
        maxUnavailable: 1
        selector:
          matchExpressions:
          - key: batch.kubernetes.io/job-name
            operator: DoesNotExist
          matchLabels:
            app: dev-testing-123
        unhealthyPodEvictionPolicy: AlwaysAllow
      status:
        conditions:
        - lastTransitionTime: "2024-12-13T09:07:33Z"
          message: ""
          observedGeneration: 2
          reason: InsufficientPods
          status: "False"
          type: DisruptionAllowed
        currentHealthy: 0
        desiredHealthy: 0
        disruptionsAllowed: 0
        expectedPods: 1
        observedGeneration: 2

      Slack thread – https://redhat-internal.slack.com/archives/C02F1J9UJJD/p1762404970980579

      Version-Release number of selected component (if applicable):

          ROSA 4.17.42

      How reproducible:

          

      Steps to Reproduce:

      1.Ensure that pods are stuck in the ImagePullBackOff state or in the container creating state.
         
      2. Create PDB and specify "unhealthyPodEvictionPolicy: AlwaysAllow".
      
      3. Ensure that CPU or memory request values are less than 50% and nothing else is preventing the cluster autoscaler from scaling down the nodes. The node should be elgible for scale down.
      
      4. Check cluster-autoscaler-default pod logs for the following message: I1106 02:06:51.263945       1 cluster.go:160] node ip-10-112-148-236.ap-southeast-2.compute.internal cannot be removed: not enough pod disruption budget to move Namespace name/aaa--bbbbb-64c7f5b688-qz6zp
      
      
          

      Actual results:

          The cluster autoscaler is not scaling down the nodes because of the PDB.

      Expected results:

          The cluster autoscaler should scale down the nodes successfully.

      Additional info:

          

              mimccune@redhat.com Michael McCune
              rhn-support-asheth Abhishek Sheth
              None
              None
              Paul Rozehnal Paul Rozehnal
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: