-
Feature
-
Resolution: Done
-
Major
-
None
-
None
-
BU Product Work
-
False
-
0% To Do, 0% In Progress, 100% Done
-
0
-
Program Call
Goal
Note: This is an internal improvement. There are no user-facing deliverables.
There are a few areas to cover for Disaster Recovery (DR):
- Finish rewriting the existing DR Bash scripts in Go
- Add guardrails to code that will not allow the customer to cause additional damage to cluster during disaster recovery.
- Cleanup technical debt from MCO repo and installer.
Why is this important?
When a cluster has an event that for example results in quorum loss this is a very stressful situation. If we can provide a very clean solution to this event with well thought out tools the admin will be pleased.
So we don't run into customer situations like this
https://docs.google.com/document/d/1ULGQARWdxjujWpSyncY0pKrUG9OcT0PlhEmYVwrPEAE/edit?ts=5eb18ea3
Scenarios
- customer has a cluster event that causes loss of quorum
- incorporates
-
RFE-1649 Test a supported way to move /var/lib/etcd to a new disk as day 2 task
- Rejected
- is related to
-
RFE-1287 Provide ability to rollback OpenShift cluster to previous release (Suggest: Automated Etcd Backups/Restores)
- Rejected
-
OCPSTRAT-215 [internal] Automated restore of etcd from external target (design)
- New
-
OCPSTRAT-539 Enhance recovery procedure for full control plane failure
- In Progress
- relates to
-
API-1376 OpenShift 4.X supports an official process to shut down, restart, and resume an OpenShift cluster from a powered off state, this function should be continuously validated, supported, and guaranteed for consumers for DR and lifecycle use-cases
- New
-
RFE-3634 Add option --skip-hash-check=true to the ETCD recovery pod
- Accepted
-
OCPSTRAT-464 Automated backups of etcd (external targets)
- Backlog
-
OCPSTRAT-403 Automated backups of etcd (local destination)
- Closed
- links to