Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.16.z
Affects Version/s: 4.13, 4.12, 4.14, 4.15, 4.16, 4.17
Component/s: Etcd
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:

4.16.z
Release Blocker:
Proposed
Sprint:
ETCD Sprint 256, ETCD Sprint 257
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Previous versions of the etcd operator checked the health of etcd members in serial with an all-member timeout that matched the single-member timeout. That allowed one slow member check to consume the entire timeout, and cause later member checks to fail on deadline-exceeded, regardless of the health of that later member. Now etcd checks the health of members in parallel, so the health and speed of one member's check doesn't affect the other members' checks.

Show
Previous versions of the etcd operator checked the health of etcd members in serial with an all-member timeout that matched the single-member timeout. That allowed one slow member check to consume the entire timeout, and cause later member checks to fail on deadline-exceeded, regardless of the health of that later member. Now etcd checks the health of members in parallel, so the health and speed of one member's check doesn't affect the other members' checks.

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-36301~~. The following is the description of the original issue:
—

Description of problem

Seen in a 4.16.1 CI run:

: [bz-Etcd] clusteroperator/etcd should not change condition/Available expand_less	1h28m39s
{  2 unexpected clusteroperator state transitions during e2e test run.  These did not match any known exceptions, so they cause this test-case to fail:

Jun 27 14:17:18.966 E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy
Jun 27 14:17:18.966 - 75s   E clusteroperator/etcd condition/Available reason/EtcdMembers_NoQuorum status/False EtcdMembersAvailable: 1 of 3 members are available, ip-10-0-71-113.us-west-1.compute.internal is unhealthy, ip-10-0-58-93.us-west-1.compute.internal is unhealthy

But further digging turned up no sign that quorum had had any difficulties. It seems like the difficulty was the GetMemberHealth structure, which currently allows timelines like:

T0, start probing all known members in GetMemberHealth
Tsmall, MemberA Healthy:true Took:41.614949ms Error:<nil>
Talmost-30s, MemberB Healthy:false Took:29.869420582s Error:health check failed: context deadline exceeded
T30s, DefaultClientTimeout runs out.
T30s, MemberC Healthy:false Took:27.199µs Error:health check failed: context deadline exceeded
TB, next probe round rolls around, start probing all known members in GetMemberHealth.
TBsmall, MemberA Healthy:true Took:...ms Error:<nil>
TB+30s, MemberB Healthy:false Took:29....s Error:health check failed: context deadline exceeded
TB+30s, DefaultClientTimeout runs out.
TB+30s, MemberC Healthy:false Took:...µs Error:health check failed: context deadline exceeded

That can leave 30+s gaps of nominal Healthy:false for MemberC when in fact MemberC was completely fine.
I suspect that the "was really short" Took:27.199µs got a "took too long" context deadline exceeded because GetMemberHealth has a 30s timeout per member, while many (all?) of its callers have a 30s DefaultClientTimeout. Which means by the time we get to MemberC, we've already spend our Context and we're starved of time to actually check MemberC. It may be more reliable to refactor and probe all known members in parallel, and to keep probing in the event of failures while you wait for the slowest member-probe to get back to you, because I suspect a re-probe of MemberC (or even a single probe that was granted reasonable time to complete) while we waited on MemberB would have succeeded and told us MemberC was actually fine.

Exposure is manageable, because this is self-healing, and quorum is actually ok. But still worth fixing because it spooks admins (and the origin CI test suite) if you tell them you're Available=False, and we want to save that for situations where the component is actually having trouble like quorum loss, and not burn signal-to-noise by claiming EtcdMembers_NoQuorum when it's really BriefIssuesScrapingMemberAHealthAndWeWillllTryAgainSoon.

Version-Release number of selected component

Seen in 4.16.1, but the code is old, so likely a longstanding issue.

How reproducible

Luckily for customers, but unluckily for QE, network or whatever hiccups when connecting to members seem rare, so we don't trip the condition that exposes this issue often.

Steps to Reproduce

1. Figure out which order etcd is probing members in.
2. Stop the first or second member, in a way that makes its health probes time out ~30s.
3. Monitor the etcd ClusterOperator Available condition.

Actual results

Available goes False claiming EtcdMembers_NoQuorum, as the operator starves itself of the time it needs to actually probe the third member.

Expected results

Available stays True, as the etcd operator take the full 30s to check on all members, and see that two of them are completely happy.

clones

OCPBUGS-36301 [4.17] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims

Closed

is blocked by

OCPBUGS-36301 [4.17] Should run health checks in parallel to avoid spurious Available=False EtcdMembers_NoQuorum claims

Closed

links to

openshift/cluster-etcd-operator#1290: [release-4.16] OCPBUGS-36489: parallelize member health checks

RHBA-2024:4965 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Haseeb Tariq

Reporter:: OpenShift Prow Bot

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/07/03 10:48 AM

Updated:: 2025/07/22 11:28 AM

Resolved:: 2024/08/02 3:10 AM

Details

Description

Description of problem

Version-Release number of selected component

How reproducible

Steps to Reproduce

Actual results

Expected results

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates