-
Bug
-
Resolution: Done
-
Undefined
-
None
-
4.13, 4.12, 4.11, 4.10, 4.9, 4.8
-
None
-
No
-
2
-
OSDOCS Sprint 247
-
1
-
False
-
-
N/A
-
Release Note Not Required
Coming from https://issues.redhat.com/browse/OCPBUGS-19740
In the above documentation around step 1/e -> step 2 we have an IMPORTANT box: "After you remove the member, the cluster might be unreachable for a short time while the remaining etcd instances reboot.".
In the past, this used to happen directly after member removal. With the introduction of the quorum guard in https://bugzilla.redhat.com/show_bug.cgi?id=2061062
this only happens after step 2 ("Turn off the quorum guard by entering the following command").
We should clarify that it is this step 2 will cause api downtime. It's kinda mentioned by "roll out static pods", but we should simply move the IMPORTANT box one step further and rephrase:
After you turned off the quorum guard, the cluster might be unreachable for a short time while the remaining etcd instances reboot to reflect the changed configuration.
Bonus Points: maybe we should also leave a couple of words on why this quorum guard is important:
When etcd is running with two members, you are not able to tolerate any additional member failure. Restarting any of the two remaining will break quorum and cause downtime in your cluster.
Quorum guard protects etcd from restarts due to configuration changes that could cause such downtime, which is why this has to be explicitly turned off.
Version-Release number of selected component (if applicable):
4.8 - 4.15
- is caused by
-
OCPBUGS-19740 [RHOCP 4.13] Cluster becomes inaccessible when ETCD member is replaced
- Closed
- links to