Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-60790

Cluster AutoScaler + CAPI Provider incorrectly interferes with MachineDeployment Replicas

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 4.18
    • Cluster Autoscaler
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Critical
    • None
    • None
    • None
    • None
    • AUTOSCALE - Sprint 276, AUTOSCALE - Sprint 277, AUTOSCALE - Sprint 278
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      • The Cluster Autoscaler CAPI provider incorrectly calculates safety constraints during MachineDeployment rolling updates, leading to massive over-deletion of machines.

       

       

      Version-Release number of selected component (if applicable):

          4.18 (but we believe this on all the versions)

      How reproducible:

        Often  

      Steps to Reproduce:

      1. Create MachineDeployment with  minSize=1 maxSize = n , assume current spec.replicas=20
      2. Set maxUnavailable=0 for rolling update
      3. Trigger rolling update (e.g., change machine template)
      4. use PDB or K8s-Shredder to prevent old machine from old machineset deleting
      5. For example 20 new machines  be created, current 40 total machines
      6. The workloads has migration from old machines to new version machinese (making old machines empty)
      7. Observe Cluster Autoscaler expect to remove 19 empty nodes which stucking rolling upgrade
      8. Result: MachineDeployment scales down to 1 machine, deleting 39 machinese total

      Actual results:

      1. Cluster Autoscaler calls SetSize(20-19) → reduces spec.replicas to 1
      2. MachineDeployment controller sees spec.replicas=1 vs 40 actual machines
      3. As old machine in old machineset will not be deleting, all new machine in new machineset were deleted expect 1.
      4. MachineDeployment controller actually deletes 39 machines (not just the 19 empty ones)
      5. Machinedeployment scales from 40 machines to 1 machine

      Expected results:

      Safety calculations should consider actual running machine count during rolling updates.
      
      If CAS find the running empty nodes are from the old machines, they should considering it's pending delete, so no scale down actions to reduce the machinedeployment replicas

      Additional info:

          

              mimccune@redhat.com Michael McCune
              rhn-support-judzhu Jude Zhu
              None
              None
              Paul Rozehnal Paul Rozehnal
              None
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated: