[OCPSTRAT-403] [Tech Preview] Automated backups of etcd (local destination) - Red Hat Issue Tracker

Type: Feature
Resolution: Done
Priority: Critical
Fix Version/s: openshift-4.14
Affects Version/s: None
Component/s: API & Datastore
Labels:
- etcd
- etcd-backups

Work Type:
BU Product Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Target Version:

openshift-4.14

Risk Score:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Priority Data:
PX Impact Score:
PX Technical Impact:
PX Impact Range:
PX Scheduling Request:
PX Review Complete:

Intelligence Requested:
Market:

BU Priority Overview

Initiative: Improve etcd disaster recovery experience (part1)

Goals

The current etcd backup and recovery process is described in our docs https://docs.openshift.com/container-platform/4.12/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html

The current process leaves up to the cluster-admin to figure out a way to do consistent backups following the documented procedure.

This feature is part of a progressive delivery to improve the cluster-admin experience for backup and restore of etcd clusters to a healthy state.

Scope of this feature:

etcd quorum loss (2 node failure) on a 3 nodes OCP control plane
etcd degradation (1 node failure) on a 3 nodes OCP control plane

Execution Plans

Improve etcd disaster recovery e2e test coverage
Design automated backup API. Initial target is local destination
Should provide a way (e.g. script or tool) for cluster-admin to validate backup files remains valid over time (e.g. account for disk failures corrupting the backup)
Should document updated manual steps to restore from local backup. These steps should be part of the e2e test coverage.
Should document manual manual steps to copy backups files to destination outside the cluster. (e.g. ssh copy a cluster admin can use in a CronJob)

blocks

OCPSTRAT-464 Automated backups of etcd (external targets)

Backlog

is related to

OCPSTRAT-529 Improve disaster recovery test coverage for etcd

Closed

relates to

API-1376 OpenShift 4.X supports an official process to shut down, restart, and resume an OpenShift cluster from a powered off state, this function should be continuously validated, supported, and guaranteed for consumers for DR and lifecycle use-cases

ACM-1699 ACM Better Integration of ETCD-backup-Policy

Closed

links to

openshift/enhancements#1370: WIP: ETCD-295: Automated Backups of Etcd

openshift/enhancements#1486: Automated Backups of Etcd

(1 links to)

Assignee:: William Caban

Reporter:: William Caban

QA Contact:: Wei Sun

Doc Contact:: Matthew Werner

Architect:: David Eads

Product Operations Engineering Contact:: Eric Rich

Votes:: 1 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2023/02/15 4:01 PM

Updated:: 2025/02/19 5:46 PM

Resolved:: 2023/09/14 3:15 PM

Details

Description

BU Priority Overview

Goals

Scope of this feature:

Execution Plans

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates