Uploaded image for project: 'OpenShift Etcd'
  1. OpenShift Etcd
  2. ETCD-328

Support deletion and automatic replacement of an unhealthy member machine in N member cluster

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False
    • OCPPLAN-9749 - Control Plane Scaling and Recovery (IPI clusters only) - Phase 0
    • ETCD Sprint 225, ETCD Sprint 226
    • Rejected

      Overview:

      Given a cluster with 1/3 unhealthy members, the expected vertical scaling workflow (with ControlPlaneMachineSets (CPMS)) is to delete the machine for that unheathy member so a new one can be created to replace it and restore the cluster back to 3 healthy members.

      This story tracks the design and work required on the etcd-operator's side to enable automated recovery in this scenario.

      Background:

      Per the upstream recommendations and design of the etcd quorum protection proposal we cannot add a new member while the etcd cluster has unhealthy members.
      https://etcd.io/docs/v3.5/faq/#should-i-add-a-member-before-removing-an-unhealthy-member
      https://github.com/openshift/enhancements/pull/943#discussion_r742209444

      One option to consider then is allowing the scale-down of the unhealthy member (as prompted by its machine deletion) so we can scale down to 2 healthy members, and subsequently scale-up the member on replacement machine that is created by CPMS.

      This needs to be carefully considered as voting membership change from 3->2->3 does put the cluster at risk of being one member away from quorum loss. More importantly it needs to be seen how this would work in conjunction with the quorum check that prevents revision rollouts when the etcd cluster is degraded with an unhealthy member.
      https://github.com/openshift/cluster-etcd-operator/pull/872
      https://github.com/openshift/cluster-etcd-operator/blob/ac362e9bf9931be0234f6c92518128536a8622cc/pkg/operator/etcdendpointscontroller/etcdendpointscontroller.go#L145-L151

      Expected outcome:

      Investigate and see if it is possible to relax the quorum protection and the revision rollout requirements to enable automated recovery of the unhealthy member in this scenario.
      If so, the implementation should have the corresponding e2e test for this scenario in vertical scaling test suite in openshift/origin.
      As a result the etcd quorum protection proposal should also be updated to document the agreed upon changes.
      https://github.com/openshift/enhancements/blob/master/enhancements/etcd/protecting-etcd-quorum-during-control-plane-scaling.md

            melbeher@redhat.com Mustafa Elbehery
            rhn-coreos-htariq Haseeb Tariq
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: