Initiative: Improve etcd disaster recovery experience (part1)
The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html
The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.
This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.
- etcd quorum loss (2 node failure) on a 3 nodes OCP control plane
- etcd degradation (1 node failure) on a 3 nodes OCP control plane
- Improve etcd disaster recovery e2e test coverage
- Design automated backup API. Initial target is local destination
- Should provide a way (e.g. script or tool) for cluster-admin to validate backup files remains valid over time (e.g. account for disk failures corrupting the backup)
- Should document updated manual steps to restore from local backup. These steps should be part of the e2e test coverage.
- Should document manual manual steps to copy backups files to destination outside the cluster. (e.g. ssh copy a cluster admin can use in a CronJob)