Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52156

Autoscaling doesn't scale enough nodes to run all jobs requested

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • AUTOSCALE - Sprint 270
    • 1
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Job Execution: 250 jobs were launched in the optimiser-prd namespace, 248 ran successfully, but 2 remain in a Pending state.

      Autoscaling: The Cluster Autoscaler created 32 new nodes to handle the workload, and all nodes appear in a Ready state.

      Detected Issue: Despite the new nodes, two pods were not scheduled, indicating a possible restriction related to resources, affinity, or another cluster condition preventing their assignment.

      Our colleague marpears@redhat.com Proposed this:
      **

      1. Node Taints Prevent Scheduling:

      • The node ip-10-163-150-170.eu-west-1.compute.internal has taints that may have been left over from a previous scale-down of the machine pool.
      • This node is the only one that meets the pending pod's nodeSelector requirements and has enough allocatable memory.
      • However, pods cannot be scheduled on it because of the ToBeDeletedByClusterAutoscaler taint.

      2. Possible Fix in OpenShift 4.18:

      3. Bug Reference for Another Related Taint:

      • The DeletionCandidateOfClusterAutoscaler taint seems related to an OpenShift bug: https://issues.redhat.com/browse/OCPBUGS-42132
      • spec:
          providerID: aws:///eu-west-1b/i-0b625b74bb78f2410
          taints:
          - effect: NoSchedule
            key: mp
            value: mp01
          - effect: NoSchedule
            key: type    value: compute
          - effect: NoExecute
            key: type
            value: compute  
          - effect: PreferNoSchedule    
            key: DeletionCandidateOfClusterAutoscaler
            value: "1740615052"
          - effect: NoSchedule
            key: ToBeDeletedByClusterAutoscaler
            value: "1740615662" 
        
        

         

      4. Request to Check with Engineering and Consider Backporting:

      • We requests your help verifying  if they agree with his assessment of the issue.
      • Since CX is still running OCP 4.16 and far from upgrading to 4.18, he asks if engineering could backport the fix to 4.16.
      • This would help resolve the issue sooner rather than waiting for an upgrade.

       

              mimccune@redhat.com Michael McCune
              rhn-support-harizape Hernando Ariza Perez
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: