Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-34754

OCP 4.x - etcd restore docs mandate a complete rebuild of non-recovery master nodes, this is not necessary.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • 4.13, 4.12, 4.14.0, 4.15, 4.16
    • Documentation / etcd
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Our documentation for both disaster recovery: restoring to a previous cluster state, and Replacing an unhealthy etcd member both discuss a necessary component which is to:
      
      - remove the faulty node from the cluster entirely, rebuild it (after removing all networking components and deploy a fresh host
      
      - This is also observed in single-node recovery process for ETCD backup and restore, wherein we advise that the nodes that are NOT the recovery host are deleted from the cluster and rebuilt from scratch, before you push a revision update.
      
      HOWEVER, this is not a necessary step if the nodes are otherwise healthy.
      
      - It is possible to instead proceed with the following steps, after the etcd restore script has been run, and a single-etcd instance is online for the recovery host:
      
      
      1. disable quorum guard:
      
      $ oc patch etcd/cluster --type=merge -p '{"spec": {"unsupportedConfigOverrides": {"useUnsupportedUnsafeNonHANonProductionUnstableEtcd": true}}}'
          
      
      2. list and remove secrets from the control plane nodes that are not the etcd-recovery host:
      
      $ oc get secrets -n openshift-etcd | grep -E "openshift-control-plane-2|openshift-control-plane-3"
      
      $ oc delete secret -n openshift-etcd <secret1> <secret2> <secret3> <secret 4> <secret5> <secret6>
      
      3. force an etcd redeployment:
      
      ~~~
      $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge 
      ~~~
      
      4. observe the new members come online, update to 4/4 status on each host + then update on recovery host from 1/1 pod to 4/4 pod, then observe that the revision will rollover to the latest build:
      
      5. check that etcd revision is unified:
      
      ~~~
      $ oc get etcd -o=jsonpath='{range.items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
      ~~~
      
      *if not unified, wait longer for the patch update to complete.
      
      6. proceed to rollout revisions of kubeapiserver:
      
      ~~~
      $ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
      
      verify:
      
      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
      ~~~
      
      7. proceed to rollout revision of kube-controller-manager:
      
      ~~~
      $ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
      
      verify:
      
      $ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
      ~~~
      
      8. Proceed to rollout revision of kubernetes scheduler:
      
      ~~~
      $ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
      
      verify:
      
      $ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'
      ~~~
      
      9. Confirm cluster status:
      
      ~~~
      $ oc -n openshift-etcd get pods -l k8s-app=etcd]
      $ oc get node
      $ oc get co
      ~~~

      Version-Release number of selected component (if applicable):

       

          4.12, 4.13, 4.14, 4.15

      How reproducible:

      every time - rebuilding nodes is no longer a required step unless the node is ACTUALLY unhealthy and must be recreated. If the goal is only to restore from backup, and/or the problems with the host can be recovered after etcd is removed and it can be re-integrated with the cluster, rebuilding from an ISO is not a necessary step for the re-integration procedure.     

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          cluster can be restored without rebuilding nodes 

      Expected results:

      docs are misleading and scope the rebuilding of a cluster node as a REQUIREMENT - I would argue that this is legacy information that may have been necessary on earlier versions of openshift. 

      Additional info:

        This can be replicated on any test cluster easily, but if you'd like a demonstration or additional contexts here please let me know I will happily provide details. 

              ocp-docs-bot OCP DocsBot
              rhn-support-wrussell Will Russell
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: