-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.18.0, 4.19.0
-
None
-
None
-
Rejected
-
False
-
During the cluster bootstrap, disruption can occur when a kube-apiserver instance doesn't have access to any live etcd endpoints. This happens in one very specific scenario:
- kube-apiserver is running on a node and is at revision 1. Its etcd-servers list contains the bootstrap node IP and localhost
- when bootstrap node is deleted, the etcd instance that was running on it will become unavailable
- when the etcd instance running the same node as the kube-apiserver instance from above is rolled-out to a new revision it will also become unavailable
When both of these scenarios happens whilst a kube-apiserver instance is still on revision 1, its readyz probe will fail
The suggested solution to fix this issue is to add a check in cluster-bootstrap that makes sure that we have at least 2 etcd-servers that are not bootstrap and localhost for each kube-apiserver pods before getting rid of the bootstrap node.
Job where this is happening: https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1387/pull-ci-openshift-cluster-etcd-operator-master-e2e-aws-ovn-serial/1880358740390055936