-
Feature
-
Resolution: Done
-
Critical
-
None
BU Priority Overview
Initiative: Improve etcd disaster recovery experience (part1)
Goals
The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html
The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.
This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.
Scope of this feature:
- etcd quorum loss (2 node failure) on a 3 nodes OCP control plane
- etcd degradation (1 node failure) on a 3 nodes OCP control plane
Execution Plans
- Improve etcd disaster recovery e2e test coverage
- Design automated backup API. Initial target is local destination
- Should provide a way (e.g. script or tool) for cluster-admin to validate backup files remains valid over time (e.g. account for disk failures corrupting the backup)
- Should document updated manual steps to restore from local backup. These steps should be part of the e2e test coverage.
- Should document manual manual steps to copy backups files to destination outside the cluster. (e.g. ssh copy a cluster admin can use in a CronJob)
- blocks
-
OCPSTRAT-464 Automated backups of etcd (external targets)
- Backlog
- is related to
-
OCPSTRAT-529 Improve disaster recovery test coverage for etcd
- Closed
- relates to
-
API-1376 OpenShift 4.X supports an official process to shut down, restart, and resume an OpenShift cluster from a powered off state, this function should be continuously validated, supported, and guaranteed for consumers for DR and lifecycle use-cases
- New
-
ACM-1699 ACM Better Integration of ETCD-backup-Policy
- Closed
- links to