Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-36489

[4.16] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 4.16.z
    • 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
    • Etcd
    • None
    • Moderate
    • No
    • 1
    • ETCD Sprint 256, ETCD Sprint 257
    • 2
    • Proposed
    • False
    • Hide

      None

      Show
      None
    • Hide
      Previous versions of the etcd operator checked the health of etcd members in serial with an all-member timeout that matched the single-member timeout. That allowed one slow member check to consume the entire timeout, and cause later member checks to fail on deadline-exceeded, regardless of the health of that later member. Now etcd checks the health of members in parallel, so the health and speed of one member's check doesn't affect the other members' checks.
      Show
      Previous versions of the etcd operator checked the health of etcd members in serial with an all-member timeout that matched the single-member timeout. That allowed one slow member check to consume the entire timeout, and cause later member checks to fail on deadline-exceeded, regardless of the health of that later member. Now etcd checks the health of members in parallel, so the health and speed of one member's check doesn't affect the other members' checks.
    • Bug Fix
    • In Progress

      This is a clone of issue OCPBUGS-36301. The following is the description of the original issue:

      Description of problem

      Seen in a 4.16.1 CI run:

      : [bz-Etcd] clusteroperator/etcd should not change condition/Available expand_less	1h28m39s
      {  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:
      
      Jun 27 14:17:18.966 E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
      Jun 27 14:17:18.966 - 75s   E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
      

      But further digging turned up no sign that quorum had had any difficulties. It seems like the difficulty was the GetMemberHealth structure, which currently allows timelines like:

      • T0, start probing all known members in GetMemberHealth
      • Tsmall, MemberA Healthy:true Took:41.614949ms Error:<nil>
      • Talmost-30s, MemberB Healthy:false Took:29.869420582s Error:health check failed: context deadline exceeded
      • T30s, DefaultClientTimeout runs out.
      • T30s, MemberC Healthy:false Took:27.199µs Error:health check failed: context deadline exceeded
      • TB, next probe round rolls around, start probing all known members in GetMemberHealth.
      • TBsmall, MemberA Healthy:true Took:...ms Error:<nil>
      • TB+30s, MemberB Healthy:false Took:29....s Error:health check failed: context deadline exceeded
      • TB+30s, DefaultClientTimeout runs out.
      • TB+30s, MemberC Healthy:false Took:...µs Error:health check failed: context deadline exceeded

      That can leave 30+s gaps of nominal Healthy:false for MemberC when in fact MemberC was completely fine.
      I suspect that the "was really short" Took:27.199µs got a "took too long" context deadline exceeded because GetMemberHealth has a 30s timeout per member, while many (all?) of its callers have a 30s DefaultClientTimeout. Which means by the time we get to MemberC, we've already spend our Context and we're starved of time to actually check MemberC. It may be more reliable to refactor and probe all known members in parallel, and to keep probing in the event of failures while you wait for the slowest member-probe to get back to you, because I suspect a re-probe of MemberC (or even a single probe that was granted reasonable time to complete) while we waited on MemberB would have succeeded and told us MemberC was actually fine.

      Exposure is manageable, because this is self-healing, and quorum is actually ok. But still worth fixing because it spooks admins (and the origin CI test suite) if you tell them you're Available=False, and we want to save that for situations where the component is actually having trouble like quorum loss, and not burn signal-to-noise by claiming EtcdMembers_NoQuorum when it's really BriefIssuesScrapingMemberAHealthAndWeWillllTryAgainSoon.

      Version-Release number of selected component

      Seen in 4.16.1, but the code is old, so likely a longstanding issue.

      How reproducible

      Luckily for customers, but unluckily for QE, network or whatever hiccups when connecting to members seem rare, so we don't trip the condition that exposes this issue often.

      Steps to Reproduce

      1. Figure out which order etcd is probing members in.
      2. Stop the first or second member, in a way that makes its health probes time out ~30s.
      3. Monitor the etcd ClusterOperator Available condition.

      Actual results

      Available goes False claiming EtcdMembers_NoQuorum, as the operator starves itself of the time it needs to actually probe the third member.

      Expected results

      Available stays True, as the etcd operator take the full 30s to check on all members, and see that two of them are completely happy.

            rhn-coreos-htariq Haseeb Tariq
            openshift-crt-jira-prow OpenShift Prow Bot
            Ge Liu Ge Liu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: