[OCPBUGS-30873] CEO aliveness check should only detect deadlocks - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: 4.16.0
Affects Version/s: 4.14.z, 4.15.z, 4.16.0
Component/s: Etcd
Labels:
None

Test Coverage:

+
Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, the etcd Cluster Operator wrongly identified non-running controllers as deadlocked and this caused an unnecessary pod restart. With this release, this issue is now fixed so that the Operator marks a non-running controller as an unhealthy etcd member without restrarting a pod. (link:https://issues.redhat.com/browse/OCPBUGS-30873[*~~OCPBUGS-30873~~*])

Show
* Previously, the etcd Cluster Operator wrongly identified non-running controllers as deadlocked and this caused an unnecessary pod restart. With this release, this issue is now fixed so that the Operator marks a non-running controller as an unhealthy etcd member without restrarting a pod. (link: https://issues.redhat.com/browse/OCPBUGS-30873 [* OCPBUGS-30873 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

From a test run in [1] we can't be sure whether the call to etcd was really deadlocked or just waiting for a result.

Currently the CheckingSyncWrapper only defines "alive" as a sync func that has not returned an error. This can be wrong in scenarios where a member is down and perpetually not reachable. 
Instead, we wanted to detect deadlock situations where the sync loop is just stuck for a prolonged period of time.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.16-upgrade-from-stable-4.15-e2e-metal-ipi-upgrade-ovn-ipv6/1762965898773139456/

Version-Release number of selected component (if applicable):

>4.14

How reproducible:

Always

Steps to Reproduce:

    1. create a healthy cluster
    2. make sure one etcd member never responds, but the node is still there (ie kubelet shutdown, blocking the etcd ports on a firewall)
    3. wait for the CEO to restart pod on failing health probe and dump its stack

Actual results:

CEO controllers are returning errors, but might not deadlock, which currently results in a restart

Expected results:

CEO should mark the member as unhealthy and continue its service without getting deadlocked and should not restart its pod by failing the health probe

Additional info:

highly related to ~~OCPBUGS-30169~~

blocks

OCPBUGS-30915 [4.15] CEO aliveness check should only detect deadlocks

Closed

clones

OCPBUGS-30169 CEO deadlocks on health checking a downed member

Closed

is cloned by

OCPBUGS-30915 [4.15] CEO aliveness check should only detect deadlocks

Closed

links to

openshift/cluster-etcd-operator#1223: OCPBUGS-30873: CEO aliveness check should only detect deadlocks

RHEA-2024:0041 OpenShift Container Platform 4.16.z bug fix update

Assignee:: Dean West

Reporter:: Thomas Jungblut

QA Contact:: Ge Liu

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2024/03/13 1:16 PM

Updated:: 2024/09/09 11:00 AM

Resolved:: 2024/06/27 11:40 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide