-
Epic
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
Disaster Recovery Automation
-
BU Product Work
-
False
-
None
-
False
-
Not Selected
-
To Do
-
OCPSTRAT-539 - Enhance recovery procedure for full control plane failure
-
OCPSTRAT-539Enhance recovery procedure for full control plane failure
-
11% To Do, 16% In Progress, 74% Done
Epic Goal*
Improve the disaster recovery experience by providing automation for the steps to recover from an etcd quorum loss scenario.
Determining the exact format of the automation (bash script, ansible playbook, CLI) is a part of this epic but ideally it would be something the admin can initiate on the recovery host that then walks through the disaster recovery steps provided the necessary inputs (e.g backup and staticpod files, ssh access to the recovery and non-recovery hosts etc).
Why is this important? (mandatory)
There are a large number of manual steps in the currently documented disaster recovery workflow which customers and support staff have voiced concerns as being too cumbersome and error prone.
https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/disaster_recovery/scenario-2-restoring-cluster-state.html
Providing more automation would improve that experience and also let the etcd team better support and test the disaster recovery workflow.
Scenarios (mandatory)
Provide details for user scenarios including actions to be performed, platform specifications, and user personas.
(TBD based on the delivery vehicle for the automation):
- As a cluster admin in a DR scenario I can trigger the quorum recovery procedure (e.g via CLI cmd on a recovery host) to reestablish quorum and recover a stable control-plane with API availability.
Dependencies (internal and external) (mandatory)
What items must be delivered by other teams/groups to enable delivery of this epic.
Contributing Teams(and contacts) (mandatory)
Our expectation is that teams would modify the list below to fit the epic. Some epics may not need all the default groups but what is included here should accurately reflect who will be involved in delivering the epic.
- Development - etcd team
- Documentation - etcd docs
- QE -
- PX -
- Others -
Acceptance Criteria (optional)
Provide some (testable) examples of how we will know if we have achieved the epic goal.
Drawbacks or Risk (optional)
Reasons we should consider NOT doing this such as: limited audience for the feature, feature will be superseded by other work that is planned, resulting feature will introduce substantial administrative complexity or user confusion, etc.
Done - Checklist (mandatory)
The following points apply to all epics and are what the OpenShift team believes are the minimum set of criteria that epics should meet for us to consider them potentially shippable. We request that epic owners modify this list to reflect the work to be completed in order to produce something that is potentially shippable.
- CI Testing - Basic e2e automationTests are merged and completing successfully
- Documentation - Content development is complete.
- QE - Test scenarios are written and executed successfully.
- Technical Enablement - Slides are complete (if requested by PLM)
- Engineering Stories Merged
- All associated work items with the Epic are closed
- Epic status should be “Release Pending”
- links to