Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Minor
Fix Version/s: None
Affects Version/s: 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
Component/s: Documentation / etcd
Labels:
- triaged

Regression:
No
Blocked:
False
Blocked Reason:

Hide

None

Show
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

Many customers (and internals) are trying to test the restore procedure in 
https://docs.openshift.com/container-platform/4.14/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html


Unfortunately we spend a lot of time on support because the OCP cluster wasn't turned into a broken enough state where our recovery docs are applicable. Examples are: the etcd static pods are not moved out, the machines are not shut down, the data dir is still there, kubelet is still running, nodes are still READY. 

We should add a section on how to turn your cluster into such a state to test this recovery procedure on a working cluster.

Let's say you're choosing master-0 as the recovery node, we need to "break" the non-recovery nodes master-1 and master-2. 

On those nodes, we expect the customer to execute with ssh:

> sudo rm -rf /etc/kubernetes/manifests/etcd-pod.yaml
> sudo rm -rf /var/lib/etcd
> sudo systemctl disable kubelet.service

This will effectively delete etcd and ensures that the node will turn into a NOT READY state. One can verify this with:

> oc get pods -n openshift-etcd (should not show etcd on those nodes anymore)

and 

> oc get nodes

which should display the non-recovery nodes as NOT READY. Note that after the second non-recovery node was done, the API is not available anymore and those commands should correctly fail.

Then, one can attempt to the restore procedure in the above documentation using master-0 as the recovery node.

Version-Release number of selected component (if applicable):

any supported OCP release

How reproducible:

always

Steps to Reproduce:

NA

Actual results:

NA

Expected results:

NA

Additional info:

that came out of a slack convo in https://redhat-internal.slack.com/archives/C027U68LP/p1717444121935539

Assignee:: Laura Hinson

Reporter:: Thomas Jungblut

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2024/06/26 3:27 PM

Updated:: 2024/08/14 1:58 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates