Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-66081

Etcd quorum lost during control plane node replacement

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • x86_64
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem
      A compliance controller ordered the deletion of all three of an OCP cluster's control plane machines within a ~1hr span (expected behavior because the machines were too old per compliance rules). Etcd quorum was lost during this time span because "old" machines were drained/deleted before replacement machines could fully join the etcd cluster.

      Version-Release number of selected component (if applicable):
      OCP v4.19.17

      How reproducible
      Unclear

      Steps to Reproduce

      1. Do oc delete machine -n openshift-machine-api master-0
      2. Wait 15 minutes
      3. Do oc delete machine -n openshift-machine-api master-1
      4. Wait 15 minutes
      5. Do oc delete machine -n openshift-machine-api master-2

      Actual results
      Control plane machines are drained/shut-down too soon, leaving the cluster with only 1-2 healthy etcd members until the replacement node has fully joined the etcd cluster. etcdNoLeader and etcdInsufficientMembers alerts fire intermittently.

      Expected results
      Deleted control plane machines aren't drained until a replacement machine has fully provisioned, joined the cluster as a node, and joined the etcd cluster as a member. Ideally, this means that there would briefly be 4 etcd members (assuming the usual 3-node control plane), but even in the worst case, CPMS should ensure there are never fewer than 2 healthy etcd members.

      Additional info
      This graph screenshot shows the observed behavior. The three "rapid" machine deletions took place between ~8pm and ~9pm. The unacceptable period (when there was only one healthy etcd member) occurs around 8:12pm. The brief spike to 4 members around 10pm was a result of a test deletion we performed after the bug was observed, and it demonstrates the expected behavior.

              ddonati@redhat.com Damiano Donati
              abyrne.openshift Anthony Byrne
              None
              None
              Zhaohua Sun Zhaohua Sun
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: