Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.14.z
Affects Version/s: 4.14.z, 4.15.z, 4.16.0
Component/s: Etcd
Labels:
- pre-merge-tested

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.14.z
Release Blocker:
Proposed
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-30915~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-30873~~. The following is the description of the original issue:
—
Description of problem:

From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.

Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

>4.14

How reproducible:

Always

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack

Actual results:

CEO controllers are returning errors, but might not deadlock, which currently results in a restart

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

highly related to ~~OCPBUGS-30169~~

clones

OCPBUGS-30915 [4.15] CEO aliveness check should only detect deadlocks

Closed

is blocked by

OCPBUGS-30915 [4.15] CEO aliveness check should only detect deadlocks

Closed

links to

openshift/cluster-etcd-operator#1231: [release-4.14] OCPBUGS-31428: CEO aliveness check should only detect deadlocks

RHBA-2024:1765 OpenShift Container Platform 4.14.z bug fix update

Assignee:: Dean West

Reporter:: OpenShift Prow Bot

Need Info From:: None

Contributors:: None

QA Contact:: Ge Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/27 5:10 AM

Updated:: 2025/07/23 11:25 AM

Resolved:: 2024/04/18 11:59 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates