Description of problem:
Etcd operator currently appears to pick at random which member to remove (I haven't found the code) when machines are going away, and this can lead to a scenario where you end up with 2 control plane nodes, and not three. To reproduce: * Create a cluster (I used AWS) * Ensure there is a ControlPlaneMachineSet, that it's spec.state is Active and that the strategy is RollingUpdate (this is the default) * Use oc to delete ALL control plane Machines simultaneously * CPMS will create a new machine for index 0 only * Etcd operator will see the new instance, add etcd, join it as a learner, eventually swap it with one of the existing members * This swap appears to be random, it would be preferable to swap onto the same AZ/index where this is duplication * Etcd removes the lifecycle hook, MAPI removes the instance * In the case that the index was not 0 for the etcd chosen, CPMS continues to not bring up a new node, we now have [0, 0, 2] * Etcd operator then decided to remove the lifecycle hook from the origin index 0 machine * The cluster went down to 2 control plane machines * The cluster eventually recovered I would not expect it to allow the cluster to go down to 2 control plane machines, but it also should be picking index 0 to replace, since it's duplicated
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1.
2.
3.
Actual results:
Expected results:
Additional info:
This came from a customer case and I managed to reproduce on 4.18.31
- relates to
-
OCPBUGS-66081 Etcd quorum lost during control plane node replacement
-
- New
-
- links to