-
Feature
-
Resolution: Done
-
None
-
None
Background
Etcd cluster failures generally fall into two categories, minority failure and majority failure. The latter, where etcd quorum is lost, results in an API server outage, and these scenarios can be resolved by restoring the etcd cluster from a backed-up snapshot. However, minority failures, where quorum is maintained, should not require disruption to API server requests to resolve because the etcd cluster can still process reads and writes.
Feature Overview
Provide non-disruptive recovery steps for etcd minority failure scenarios, enhancing the stability of our platform and preventing data loss and service disruptions.
Goals
- Ensure that we have viable and safe manual recovery methods for etcd minority failure that does not require API server disruption (v4.14).
- Automate the etcd minority failure recovery methods (v4.15 onwards).
Requirements
- The manual recovery must be possible when etcd is backed by local storage.
- The manual recovery must be non-disruptive, with API read/write operations continuing to work as long as the existing etcd cluster maintains quorum.
Use Cases
- A single etcd instance, backed by local storage, loses access to its node and there's no available snapshot for recovery. In this case, recovery should be possible using the remaining etcd instances which still have quorum.
- A single etcd instance's PVC data might be lost, but a snapshot of that data exists. Recovery of the PVC from a snapshot and rescheduling of the etcd instance should be possible.
- A single etcd instance backed by local storage needs to be moved to another management node. This process should be possible by backing up the existing PVC to a snapshot on distributed storage and then restoring that data to local storage on another management node.
Out of Scope
- Recovery steps for etcd majority failure scenarios.
- Recovery process for etcd instances not backed by local storage.
Customer Considerations
Given the central role that etcd plays in the operation of the clusters, disruptions can have significant impacts on customers. Ensuring a smooth recovery process will help minimize downtime and data loss.
Documentation Considerations
Documentation should be created detailing the (manual) process of non-disruptive recovery for etcd minority failure scenarios. It should include different use cases, potential challenges, and recovery steps
- relates to
-
HOSTEDCP-1070 Control Plane Pod supported persistent storage backends
- Closed
- links to
- mentioned in
-
Page Loading...