Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-47509

etcdMembersDown should not fire on healthy etcd scaling event

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Normal Normal
    • None
    • 4.14, 4.19
    • Etcd
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In a live 4.14.35 production cluster working through a control-plane replacement, where administrators deleted Machines, and the ControlPlaneMachineSet controller worked with the etcd operator to bring in replacement, etcdMembersDown fired for one of the outgoing instances. But that member's exit was expected, and the alarming alert firing during the healthy removal caused some concern, and manual checks to confirm that the removal was in fact proceeding without issue. We should adjust the alert to not fire in these smooth control-plane replacements, to avoid distracting future administrators.

      Version-Release number of selected component

      Seen in a 4.14.35 cluster, but also turns up in dev/4.19 CI:

      $ w3m -dump -cols 200 'https://search.dptools.openshift.org/?search=etcdMembersDown.*firing+for+%5B%5E0%5D&maxAge=24h&type=junit' | grep 'failures match'
      pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-gcp-ovn-etcd-scaling (all) - 3 runs, 67% failed, 100% of failures match = 67% impact
      pull-ci-openshift-cluster-control-plane-machine-set-operator-main-e2e-vsphere-ovn-etcd-scaling (all) - 3 runs, 67% failed, 50% of failures match = 33% impact
      periodic-ci-openshift-release-master-nightly-4.17-e2e-aws-ovn-etcd-scaling (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
      

      How reproducible

      According to my CI Search results, something like half of all etcd-scaling CI reproduce this alert issue.

      Steps to Reproduce

      1. Run some etcd-scaling CI.
      2. Check the alert/etcdMembersDown should not be at or above info test-case.

      Actual results

      Sometimes the test-case passes, but about half the time it fails because the alert fires. For example, in this run:

      : [bz-etcd][invariant] alert/etcdMembersDown should not be at or above info expand_less	0s
      {  etcdMembersDown was at or above info for at least 1m58s on platformidentification.JobType{Release:"4.19", FromRelease:"", Platform:"gcp", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 14m34s, firing for 1m58s:
      
      Dec 20 17:05:27.811 - 118s  E namespace/openshift-etcd pod/etcd-ci-op-4cyd0vfv-c7885-s78z5-master-1 alert/etcdMembersDown alertstate/firing severity/critical ALERTS{alertname="etcdMembersDown", alertstate="firing", job="etcd", namespace="openshift-etcd", pod="etcd-ci-op-4cyd0vfv-c7885-s78z5-master-1", prometheus="openshift-monitoring/k8s", service="etcd", severity="critical"}}
      

      Expected results

      Test-case passes reliably, because the alert doesn't fire.

        1. screenshot-1.png
          232 kB
          W. Trevor King
        2. screenshot-2.png
          124 kB
          W. Trevor King
        3. screenshot-3.png
          179 kB
          W. Trevor King

              dwest@redhat.com Dean West
              trking W. Trevor King
              None
              None
              Ge Liu Ge Liu
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: