-
Feature
-
Resolution: Done
-
Critical
-
None
-
None
-
BU Product Work
-
False
-
-
False
-
50% To Do, 0% In Progress, 50% Done
-
XL
-
0
-
Program Call
Goal
Automate the recovery process from an etcd quorum loss scenario in OpenShift, automating the manual steps required and improving the user experience.
Why is this important
- Reduced Complexity Eliminates the need to follow a manual recovery process
- Improved Efficiency Saves time and minimizes human error during disaster recovery situations
- Enhanced Support Makes it easier for the etcd team to support and test disaster recovery workflows
- OpenShift Virtualization Topologies All of these points are important for OpenShift Virtualization topologies with 2+2 control plane nodes across two sites, providing high availability and fast recovery.*
Initiative: Improve etcd disaster recovery experience (part3)
With OCPBU-252 and OCPBU-254 we create the foundations for an enhanced experience of a recovery procedure in the case of full control plane loss. This requires researching total control-plane failure scenarios of clusters deployed using the various deployment methodologies.
Scope of this feature:
- Spike to research if restoring full control plane with identical properties as the original control plane allow re-importing workers and document workload behavior
- Document procedure to restore from full control plane failure using compact cluster to restore control plane and the re-attachment of workers
- Enhanced e2e testing for validation of the updated manual procedure under this feature
- blocks
-
OCPSTRAT-215 [internal] Automated restore of etcd from external target (design)
- New
- depends on
-
OCPSTRAT-1395 Automated control-plane recovery from expired certificates (hibernation)
- Release Pending
- is blocked by
-
OCPSTRAT-464 Automated backups of etcd (external targets)
- Backlog
- is related to
-
OCPSTRAT-1199 4 and 5-nodes control-plane architecture for bare-metal spanned clusters
- Closed
- relates to
-
API-1376 OpenShift 4.X supports an official process to shut down, restart, and resume an OpenShift cluster from a powered off state, this function should be continuously validated, supported, and guaranteed for consumers for DR and lifecycle use-cases
- New
-
OCPSTRAT-529 Improve disaster recovery test coverage for etcd
- Closed
- links to