[ETCD-328] Support deletion and automatic replacement of an unhealthy member machine in N member cluster - Red Hat Issue Tracker

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Epic Link:
ETCD-333
Feature Link:
OCPPLAN-9749 - Control Plane Scaling and Recovery (IPI clusters only) - Phase 0

Sprint:
ETCD Sprint 225, ETCD Sprint 226

Release Blocker:
Rejected

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Overview:

Given a cluster with 1/3 unhealthy members, the expected vertical scaling workflow (with ControlPlaneMachineSets (CPMS)) is to delete the machine for that unheathy member so a new one can be created to replace it and restore the cluster back to 3 healthy members.

This story tracks the design and work required on the etcd-operator's side to enable automated recovery in this scenario.

Background:

Per the upstream recommendations and design of the etcd quorum protection proposal we cannot add a new member while the etcd cluster has unhealthy members.
https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member
https://github.com/openshift/enhancements/pull/943#discussion_r742209444

One option to consider then is allowing the scale-down of the unhealthy member (as prompted by its machine deletion) so we can scale down to 2 healthy members, and subsequently scale-up the member on replacement machine that is created by CPMS.

This needs to be carefully considered as voting membership change from 3->2->3 does put the cluster at risk of being one member away from quorum loss. More importantly it needs to be seen how this would work in conjunction with the quorum check that prevents revision rollouts when the etcd cluster is degraded with an unhealthy member.
https://github.com/openshift/cluster-etcd-operator/pull/872
https://github.com/openshift/cluster-etcd-operator/blob/ac362e9bf9931be0234f6c92518128536a8622cc/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L145-L151

Expected outcome:

Investigate and see if it is possible to relax the quorum protection and the revision rollout requirements to enable automated recovery of the unhealthy member in this scenario.
If so, the implementation should have the corresponding e2e test for this scenario in vertical scaling test suite in openshift/origin.
As a result the etcd quorum protection proposal should also be updated to document the agreed upon changes.
https://github.com/openshift/enhancements/blob/master/enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

causes

ETCD-336 E2E deletion and automatic replacement of an unhealthy member machine in N member cluster

In Progress

OCPBUGS-2979 [4.12] automatic replacement of an unhealthy member machine

Closed

is cloned by

ETCD-336 E2E deletion and automatic replacement of an unhealthy member machine in N member cluster

In Progress

links to

openshift/cluster-etcd-operator#937: OCPBUGS-1000: Allow scale-down of unhealthy member

openshift/cluster-etcd-operator#947: ETCD-328: automatic replacement of an unhealthy member machine

openshift/cluster-etcd-operator#960: [release-4.12] ETCD-328: automatic replacement of an unhealthy member machine

Test Plan

(2 links to)

Assignee:: Mustafa Elbehery

Reporter:: Haseeb Tariq

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2022/09/29 11:37 PM

Updated:: 2022/10/31 12:17 PM

Resolved:: 2022/10/31 12:00 PM

Details

Description

Overview:

Background:

Expected outcome:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates