-
Bug
-
Resolution: Unresolved
-
Minor
-
None
-
4.13, 4.12, 4.14, 4.15, 4.16, 4.17
-
No
-
False
-
Description of problem:
Many customers (and internals) are trying to test the restore procedure in https://docs.openshift.com/container-platform/4.14/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html Unfortunately we spend a lot of time on support because the OCP cluster wasn't turned into a broken enough state where our recovery docs are applicable. Examples are: the etcd static pods are not moved out, the machines are not shut down, the data dir is still there, kubelet is still running, nodes are still READY. We should add a section on how to turn your cluster into such a state to test this recovery procedure on a working cluster. Let's say you're choosing master-0 as the recovery node, we need to "break" the non-recovery nodes master-1 and master-2. On those nodes, we expect the customer to execute with ssh: > sudo rm -rf /etc/kubernetes/manifests/etcd-pod.yaml > sudo rm -rf /var/lib/etcd > sudo systemctl disable kubelet.service This will effectively delete etcd and ensures that the node will turn into a NOT READY state. One can verify this with: > oc get pods -n openshift-etcd (should not show etcd on those nodes anymore) and > oc get nodes which should display the non-recovery nodes as NOT READY. Note that after the second non-recovery node was done, the API is not available anymore and those commands should correctly fail. Then, one can attempt to the restore procedure in the above documentation using master-0 as the recovery node.
Version-Release number of selected component (if applicable):
any supported OCP release
How reproducible:
always
Steps to Reproduce:
NA
Actual results:
NA
Expected results:
NA
Additional info:
that came out of a slack convo in https://redhat-internal.slack.com/archives/C027U68LP/p1717444121935539