Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Critical
Fix Version/s: 4.18
Affects Version/s: None
Labels:
None

Epic Name:
Disaster Recovery Automation
Epic Status:
To Do
Activity Type:
Product / Portfolio Work
Parent Link:
OCPSTRAT-539Enhance recovery procedure for full control plane failure
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Color Status:
Not Selected
Size:
None

Target Version:
None
Release Blocker:
None

Epic Goal*

Improve the disaster recovery experience by providing automation for the steps to recover from an etcd quorum loss scenario.

Determining the exact format of the automation (bash script, ansible playbook, CLI) is a part of this epic but ideally it would be something the admin can initiate on the recovery host that then walks through the disaster recovery steps provided the necessary inputs (e.g backup and staticpod files, ssh access to the recovery and non-recovery hosts etc).

Why is this important? (mandatory)

There are a large number of manual steps in the currently documented disaster recovery workflow which customers and support staff have voiced concerns as being too cumbersome and error prone.
https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html

Providing more automation would improve that experience and also let the etcd team better support and test the disaster recovery workflow.

Scenarios (mandatory)

Provide details for user scenarios including actions to be performed, platform specifications, and user personas.

(TBD based on the delivery vehicle for the automation):

As a cluster admin in a DR scenario I can trigger the quorum recovery procedure (e.g via CLI cmd on a recovery host) to reestablish quorum and recover a stable control-plane with API availability.

Dependencies (internal and external) (mandatory)

What items must be delivered by other teams/groups to enable delivery of this epic.

Contributing Teams(and contacts) (mandatory)

Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.

Development - etcd team
Documentation - etcd docs
QE -
PX -
Others -

Acceptance Criteria (optional)

Provide some (testable) examples of how we will know if we have achieved the epic goal.

Drawbacks or Risk (optional)

Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.

Done - Checklist (mandatory)

The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.