Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-2031

parallel scale up and scale down can cause undesired state for MachineSet when provisioning is failing

XMLWordPrintable

    • Important
    • None
    • CLOUD Sprint 227
    • 1
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Having OpenShift Container Platform 4 installed on AWS or Azure (does not matter) and MachineSet configured with "deletePolicy: Oldest" can lead to undesired state of MachineSet when parallel scaling (up and down) is happening but provisioning.
      
      $ oc get machine
      NAME                                           PHASE          TYPE         REGION      ZONE         AGE
      fooba03327360-5fzq5-master-0                  Running        m6i.xlarge   us-east-2   us-east-2a   40m
      fooba03327360-5fzq5-master-1                  Running        m6i.xlarge   us-east-2   us-east-2b   40m
      fooba03327360-5fzq5-master-2                  Running        m6i.xlarge   us-east-2   us-east-2c   40m
      fooba03327360-5fzq5-worker-us-east-2a-58k7v   Running        m6i.large    us-east-2   us-east-2a   4m40s
      fooba03327360-5fzq5-worker-us-east-2a-jgpl2   Provisioning   m6i.large    us-east-2   us-east-2a   12s
      fooba03327360-5fzq5-worker-us-east-2a-knhlb   Running        m6i.large    us-east-2   us-east-2a   38m
      fooba03327360-5fzq5-worker-us-east-2b-tlhnd   Running        m6i.large    us-east-2   us-east-2b   37m
      fooba03327360-5fzq5-worker-us-east-2c-7hp8x   Running        m6i.large    us-east-2   us-east-2c   4m36s
      fooba03327360-5fzq5-worker-us-east-2c-g5bps   Provisioning   m6i.large    us-east-2   us-east-2c   12s
      fooba03327360-5fzq5-worker-us-east-2c-jw66f   Deleting       m6i.large    us-east-2   us-east-2c   38m
      
      $ oc get machineset
      NAME                                     DESIRED   CURRENT   READY   AVAILABLE   AGE
      fooba03327360-5fzq5-worker-us-east-2a   3         3         2       2           40m
      fooba03327360-5fzq5-worker-us-east-2b   1         1         1       1           40m
      fooba03327360-5fzq5-worker-us-east-2c   2         2         1       1           40m
      
      In the above example, we can see that MachineSet called fooba03327360-5fzq5-worker-us-east-2c is being scaled-up and in parallel a Machine is removed. This already leads to an undesired state as suddenly we have only 1 available OpenShift Container Platform 4 - Node instead of the desired 2.
      
      Just consider the provisioning of the new machine would fail because the instance type is currently not available for the MachineSet that is trying to scale or because of another problem. With that we would be stuck with only 1 OpenShift Container Platform 4 - Node instead of the desired 2.
      
      So based on the above, the expectation would be that when in parallel up and down scaling is happening and first OpenShift Container Platform 4 - Node(s) in provisioning/provisioned state are going to be removed before any other OpenShift Container Platform 4 - Node is being removed. That way, everything running will continue run and once the OpenShift Container Platform 4 - Node(s) in provisioning/provisioned state are removed, additional action can be taken if further down scaling is required. But it would ensure that capacity does remain available and also that we don't remove OpenShift Container Platform 4 - Node(s) before new OpenShift Container Platform 4 - Node(s) are available and their capacity can be used.
      

      Version-Release number of selected component (if applicable):

      OpenShift Container Platform 4.10.28 (but we suspect this applies to all OpenShift Container Platform 4 version currently available)

      How reproducible:

      Always

      Steps to Reproduce:

      1. Install OpenShift Container Platform 4 on AWS or Azure for example
      2. Set "deletePolicy: Oldest" in the MachineSet
      3. Scale the MachineSet up and while the Machine is provisioning Scale down
      4 . The MachineSet will only have 1 Node available instead of 2

      Actual results:

      Only one Machine available in the MachineSet instead of 2 desired because a properly working Machine is removed while provisioning of the new Machine is taking time or eventually will fail at some point in time.

      Expected results:

      Always try to remove Machines that are in provisioning/provisioned state that are not yet ready before removing perfectly working Machines as otherwise one might experience capacity issues

      Additional info:

       

        1. Screenshot 2022-11-08 at 16.20.21.png
          569 kB
          Damiano Donati
        2. Screenshot 2022-11-09 at 11.38.16.png
          381 kB
          Damiano Donati
        3. Screenshot 2022-11-09 at 11.38.18.png
          420 kB
          Damiano Donati

              ddonati@redhat.com Damiano Donati
              rhn-support-sreber Simon Reber
              Milind Yadav Milind Yadav
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: