This has been reported by lance5890 upstream: https://github.com/openshift/cluster-etcd-operator/issues/1237
Description of problem:
During master node removal (out of 3), the etcd cert signer controller might still rollout a revision even though quorum is obviously going to be broken with that. Important events: 08:06:26.674067 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'MasterNodeRemoved' Observed removal of master node node3 08:06:26.909780 1 base_controller.go:272] EtcdEndpointsController reconciliation failed: EtcdEndpointsController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available 08:06:27.005308 1 event.go:285] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"253380d4-7d65-496f-8214-ab89f7878550", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretUpdated' Updated Secret/etcd-all-certs -n openshift-etcd because it changed 08:06:27.149860 1 base_controller.go:272] EtcdCertSignerController reconciliation failed: EtcdCertSignerController can't evaluate whether quorum is safe: CheckSafeToScaleCluster 3 nodes are required, but only 2 are available
Version-Release number of selected component (if applicable):
all versions where we introduced the quorum guard (> 4.12 current applicable).
How reproducible:
depends on the timing of the removal and the controller runs, but somewhat frequent.
Steps to Reproduce:
1. remove a master node 2. wait for quorum loss / downtime due to revision rollout
Actual results:
quorum is lost and there is brief api downtime during the revision is rolled out
Expected results:
the revisioned secret should not be updated when quorum is about to be lost
Additional info:
- is duplicated by
-
ETCD-612 Skip static pod rollouts
- Closed
- links to
-
RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update