Description of problem:
N.B. I have been testing this in various combinations and the below are my observations, but may not be entirely representative of what's happening. The etcd operator has changed behaviour between 4.17 and 4.18 to allow removal of pending deleted machines in some manner. I believe this may be causing the cluster to degrade during an OnDelete rollout where folks are trying to replace multiple machines simultaneously. This behaviour is a regression which we should aim to fix as the results are pretty bad. In 4.17, if multiple machines were being replaced (someone had called oc delete on all the machines at once), the etcd operator would wait for the all revisions to rollout and for the cluster to stabilise before removing any lifecycle hooks from the machines. Then we would see a "big bang" where all three machines remove lifecycle hooks together, and then the Machine API proceeds to remove the three machines. At no point during this process do we see any API errors. In 4.18, it appears that the etcd operator removes the pre-drain hooks while the cluster is still rolling out some revisions. A first machine has its pre-drain hook removed and starts to be drained. At this point, that etcd member has been removed from the cluster, and is unhealthy. I believe the revision rollouts are still including it though and this makes the etcd cluster health impacted by this. The cluster continues to a point where we end up with KAS installer stuck in pending, not all etcd pods get created, failing KAS pods (linked to an etcd member removed from the cluster) blocks progress of removal of machines due to PDBs from KASO, and then the cluster API starts timing out requests (quorum loss?). If I go to the cloud provider and terminate the old three masters, the cluster does recover correctly, but this shouldn't be necessary and the cluster should never return the bad requests.
Version-Release number of selected component (if applicable):
How reproducible:
100%
Steps to Reproduce:
1. Start a 4.18 cluster
2. Use `oc edit` to change the controlplanemachineset `cluster` `.spec.strategy.type` to `OnDelete`
3. Use `oc delete` to delete all three control plane machines
4. Observe the cluster
Actual results:
Cluster ends up in a degraded state
Expected results:
Cluster should never be degraded, rollout should complete
Additional info: