Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-73857

Etcd Operator should attempt to scale down replicas in the same AZ first

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.18.z, 4.19.z, 4.20.z, 4.22.0, 4.21.z
    • Etcd
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Etcd operator currently appears to pick at random which member to remove (I haven't found the code) when machines are going away, and this can lead to a scenario where you end up with 2 control plane nodes, and not three.
      
      To reproduce:
      * Create a cluster (I used AWS)
      * Ensure there is a ControlPlaneMachineSet, that it's spec.state is Active and that the strategy is RollingUpdate (this is the default)
      * Use oc to delete ALL control plane Machines simultaneously
      * CPMS will create a new machine for index 0 only
      * Etcd operator will see the new instance, add etcd, join it as a learner, eventually swap it with one of the existing members
        * This swap appears to be random, it would be preferable to swap onto the same AZ/index where this is duplication
      * Etcd removes the lifecycle hook, MAPI removes the instance
      * In the case that the index was not 0 for the etcd chosen, CPMS continues to not bring up a new node, we now have [0, 0, 2]
      * Etcd operator then decided to remove the lifecycle hook from the origin index 0 machine
      * The cluster went down to 2 control plane machines
      * The cluster eventually recovered
      
      I would not expect it to allow the cluster to go down to 2 control plane machines, but it also should be picking index 0 to replace, since it's duplicated

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

          This came from a customer case and I managed to reproduce on 4.18.31

              joelspeed Joel Speed
              joelspeed Joel Speed
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: