Loading...

XML

Word

Printable

Type: Feature
Resolution: Done
Fix Version/s: None
Affects Version/s: None
Component/s: Hosted Control Planes
Labels:
- ga_readiness
- self-managed

Activity Type:
Product / Portfolio Work
Parent Link:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Size:
None

Target Version:

openshift-4.14
Release Blocker:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:
None
PX Technical Impact:
None
PX Impact Range:
None
PX Scheduling Request:
None
PX Technical Impact Notes:
None

Intelligence Requested:
Market:

Background

Etcd cluster failures generally fall into two categories, minority failure and majority failure. The latter, where etcd quorum is lost, results in an API server outage, and these scenarios can be resolved by restoring the etcd cluster from a backed-up snapshot. However, minority failures, where quorum is maintained, should not require disruption to API server requests to resolve because the etcd cluster can still process reads and writes.

Feature Overview

Provide non-disruptive recovery steps for etcd minority failure scenarios, enhancing the stability of our platform and preventing data loss and service disruptions.

Goals

Ensure that we have viable and safe manual recovery methods for etcd minority failure that does not require API server disruption (v4.14).
Automate the etcd minority failure recovery methods (v4.15 onwards).

Requirements

The manual recovery must be possible when etcd is backed by local storage.
The manual recovery must be non-disruptive, with API read/write operations continuing to work as long as the existing etcd cluster maintains quorum.

Use Cases

A single etcd instance, backed by local storage, loses access to its node and there's no available snapshot for recovery. In this case, recovery should be possible using the remaining etcd instances which still have quorum.
A single etcd instance's PVC data might be lost, but a snapshot of that data exists. Recovery of the PVC from a snapshot and rescheduling of the etcd instance should be possible.
A single etcd instance backed by local storage needs to be moved to another management node. This process should be possible by backing up the existing PVC to a snapshot on distributed storage and then restoring that data to local storage on another management node.

Out of Scope

Recovery steps for etcd majority failure scenarios.
Recovery process for etcd instances not backed by local storage.

Customer Considerations

Given the central role that etcd plays in the operation of the clusters, disruptions can have significant impacts on customers. Ensuring a smooth recovery process will help minimize downtime and data loss.

Documentation Considerations

Documentation should be created detailing the (manual) process of non-disruptive recovery for etcd minority failure scenarios. It should include different use cases, potential challenges, and recovery steps

relates to

HOSTEDCP-1070 Control Plane Pod supported persistent storage backends