Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.13, 4.12, 4.11, 4.10, 4.9, 4.8
Component/s: Documentation / etcd
Labels:
None

Regression:
No
Story Points:
2
Sprint:
OSDOCS Sprint 247
sprint_count:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Coming from https://issues.redhat.com/browse/OCPBUGS-19740

Doc Link:
https://docs.openshift.com/container-platform/4.13/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html

In the above documentation around step 1/e -> step 2 we have an IMPORTANT box: "After you remove the member, the cluster might be unreachable for a short time while the remaining etcd instances reboot.".

In the past, this used to happen directly after member removal. With the introduction of the quorum guard in https://bugzilla.redhat.com/show_bug.cgi?id=2061062

this only happens after step 2 ("Turn off the quorum guard by entering the following command").

We should clarify that it is this step 2 will cause api downtime. It's kinda mentioned by "roll out static pods", but we should simply move the IMPORTANT box one step further and rephrase:

After you turned off the quorum guard, the cluster might be unreachable for a short time while the remaining etcd instances reboot to reflect the changed configuration.

Bonus Points: maybe we should also leave a couple of words on why this quorum guard is important:

When etcd is running with two members, you are not able to tolerate any additional member failure. Restarting any of the two remaining will break quorum and cause downtime in your cluster.

Quorum guard protects etcd from restarts due to configuration changes that could cause such downtime, which is why this has to be explicitly turned off.

Version-Release number of selected component (if applicable):