Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-19883

Move reboot warning in "Replacing an unhealthy etcd member"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.13, 4.12, 4.11, 4.10, 4.9, 4.8
    • Documentation / etcd
    • None
    • No
    • 2
    • OSDOCS Sprint 247
    • 1
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Coming from https://issues.redhat.com/browse/OCPBUGS-19740

      Doc Link:
      https://docs.openshift.com/container-platform/4.13/backup_and_restore/control_plane_backup_and_restore/replacing-unhealthy-etcd-member.html

      In the above documentation around step 1/e -> step 2 we have an IMPORTANT box: "After you remove the member, the cluster might be unreachable for a short time while the remaining etcd instances reboot.".

      In the past, this used to happen directly after member removal. With the introduction of the quorum guard in https://bugzilla.redhat.com/show_bug.cgi?id=2061062

      this only happens after step 2 ("Turn off the quorum guard by entering the following command").

      We should clarify that it is this step 2 will cause api downtime. It's kinda mentioned by "roll out static pods", but we should simply move the IMPORTANT box one step further and rephrase:

      After you turned off the quorum guard, the cluster might be unreachable for a short time while the remaining etcd instances reboot to reflect the changed configuration.


       

      Bonus Points: maybe we should also leave a couple of words on why this quorum guard is important:

      When etcd is running with two members, you are not able to tolerate any additional member failure. Restarting any of the two remaining will break quorum and cause downtime in your cluster. 

      Quorum guard protects etcd from restarts due to configuration changes that could cause such downtime, which is why this has to be explicitly turned off. 

       

      Version-Release number of selected component (if applicable):

      4.8 - 4.15

            kowen@redhat.com Kevin Owen
            tjungblu@redhat.com Thomas Jungblut
            ge liu ge liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: