Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.15.z
Affects Version/s: 4.14.z, 4.15.z, 4.16.0
Component/s: Etcd
Labels:
None

Test Coverage:

+
Severity:
Important
Regression:
No
Release Blocker:
Proposed
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.15.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-30873~~. The following is the description of the original issue:
—
Description of problem:

From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.

Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

>4.14

How reproducible:

Always

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack

Actual results:

CEO controllers are returning errors, but might not deadlock, which currently results in a restart

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

highly related to ~~OCPBUGS-30169~~

blocks

OCPBUGS-31428 [4.14] CEO aliveness check should only detect deadlocks

Closed

clones

OCPBUGS-30873 CEO aliveness check should only detect deadlocks

Closed

is blocked by

OCPBUGS-30873 CEO aliveness check should only detect deadlocks

Closed

is cloned by

OCPBUGS-31428 [4.14] CEO aliveness check should only detect deadlocks

Closed

links to

openshift/cluster-etcd-operator#1225: [release-4.15] OCPBUGS-30915: CEO aliveness check should only detect deadlocks

RHBA-2024:1559 OpenShift Container Platform 4.15.z bug fix update

(1 links to)

Assignee:: Dean West

Reporter:: OpenShift Prow Bot

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/14 8:23 AM

Updated:: 2024/09/09 11:00 AM

Resolved:: 2024/04/02 7:33 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates