Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36218

Restoring to a previous cluster state needs more testing information

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Minor Minor
    • None
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
    • Documentation / etcd
    • No
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Many customers (and internals) are trying to test the restore procedure in 
      https://docs.openshift.com/container-platform/4.14/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html
      
      
      Unfortunately we spend a lot of time on support because the OCP cluster wasn't turned into a broken enough state where our recovery docs are applicable. Examples are: the etcd static pods are not moved out, the machines are not shut down, the data dir is still there, kubelet is still running, nodes are still READY. 
      
      We should add a section on how to turn your cluster into such a state to test this recovery procedure on a working cluster.
      
      Let's say you're choosing master-0 as the recovery node, we need to "break" the non-recovery nodes master-1 and master-2. 
      
      On those nodes, we expect the customer to execute with ssh:
      
      > sudo rm -rf /etc/kubernetes/manifests/etcd-pod.yaml
      > sudo rm -rf /var/lib/etcd
      > sudo systemctl disable kubelet.service
      
      This will effectively delete etcd and ensures that the node will turn into a NOT READY state. One can verify this with:
      
      > oc get pods -n openshift-etcd (should not show etcd on those nodes anymore)
      
      and 
      
      > oc get nodes
      
      which should display the non-recovery nodes as NOT READY. Note that after the second non-recovery node was done, the API is not available anymore and those commands should correctly fail.
      
      Then, one can attempt to the restore procedure in the above documentation using master-0 as the recovery node.

      Version-Release number of selected component (if applicable):

      any supported OCP release    

      How reproducible:

      always

      Steps to Reproduce:

      NA
          

      Actual results:

      NA    

      Expected results:

      NA    

      Additional info:

      that came out of a slack convo in https://redhat-internal.slack.com/archives/C027U68LP/p1717444121935539
          

       

              rhn-support-lahinson Laura Hinson
              tjungblu@redhat.com Thomas Jungblut
              Ge Liu Ge Liu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: