Description of problem
Seen in a 4.16.1 CI run:
: [bz-Etcd] clusteroperator/etcd should not change condition/Available expand_less 1h28m39s { 2 unexpected clusteroperator state transitions during e2e test run. These did not match any known exceptions, so they cause this test-case to fail: Jun 27 14:17:18.966 E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy Jun 27 14:17:18.966 - 75s E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
But further digging turned up no sign that quorum had had any difficulties. It seems like the difficulty was the GetMemberHealth structure, which currently allows timelines like:
- T0, start probing all known members in GetMemberHealth
- Tsmall, MemberA Healthy:true Took:41.614949ms Error:<nil>
- Talmost-30s, MemberB Healthy:false Took:29.869420582s Error:health check failed: context deadline exceeded
- T30s, DefaultClientTimeout runs out.
- T30s, MemberC Healthy:false Took:27.199µs Error:health check failed: context deadline exceeded
- TB, next probe round rolls around, start probing all known members in GetMemberHealth.
- TBsmall, MemberA Healthy:true Took:...ms Error:<nil>
- TB+30s, MemberB Healthy:false Took:29....s Error:health check failed: context deadline exceeded
- TB+30s, DefaultClientTimeout runs out.
- TB+30s, MemberC Healthy:false Took:...µs Error:health check failed: context deadline exceeded
That can leave 30+s gaps of nominal Healthy:false for MemberC when in fact MemberC was completely fine.
I suspect that the "was really short" Took:27.199µs got a "took too long" context deadline exceeded because GetMemberHealth has a 30s timeout per member, while many (all?) of its callers have a 30s DefaultClientTimeout. Which means by the time we get to MemberC, we've already spend our Context and we're starved of time to actually check MemberC. It may be more reliable to refactor and probe all known members in parallel, and to keep probing in the event of failures while you wait for the slowest member-probe to get back to you, because I suspect a re-probe of MemberC (or even a single probe that was granted reasonable time to complete) while we waited on MemberB would have succeeded and told us MemberC was actually fine.
Exposure is manageable, because this is self-healing, and quorum is actually ok. But still worth fixing because it spooks admins (and the origin CI test suite) if you tell them you're Available=False, and we want to save that for situations where the component is actually having trouble like quorum loss, and not burn signal-to-noise by claiming EtcdMembers_NoQuorum when it's really BriefIssuesScrapingMemberAHealthAndWeWillllTryAgainSoon.
Version-Release number of selected component
Seen in 4.16.1, but the code is old, so likely a longstanding issue.
How reproducible
Luckily for customers, but unluckily for QE, network or whatever hiccups when connecting to members seem rare, so we don't trip the condition that exposes this issue often.
Steps to Reproduce
1. Figure out which order etcd is probing members in.
2. Stop the first or second member, in a way that makes its health probes time out ~30s.
3. Monitor the etcd ClusterOperator Available condition.
Actual results
Available goes False claiming EtcdMembers_NoQuorum, as the operator starves itself of the time it needs to actually probe the third member.
Expected results
Available stays True, as the etcd operator take the full 30s to check on all members, and see that two of them are completely happy.
- blocks
-
OCPBUGS-36489 [4.16] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims
- Closed
- is cloned by
-
OCPBUGS-36489 [4.16] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims
- Closed
- is related to
-
OCPBUGS-36462 control-plane-machine-set goes Available=False with UnavailableReplicas during etcd scale testing
- Closed
- relates to
-
OCPBUGS-27959 Panic: send on closed channel
- Closed
- links to
-
RHEA-2024:3718 OpenShift Container Platform 4.17.z bug fix update