Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Minor
Fix Version/s: 4.14.0
Affects Version/s: 4.14
Component/s: Etcd
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
Low
Regression:
No

Target Backport Versions:
None
Target Version:

4.14.0
Release Blocker:
None
Sprint:
ETCD Sprint 238
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

We've had several forum cases and bugs already where a restart of the CEO was fixing issues that could be resolved automatically by a liveness probe.

We previously traced it down to stuck/deadlocked controllers, missing timeouts in grpc calls and other issues we haven't been able to find yet. Since the list of failures that can happen is pretty large, we should add a liveness probe to the CEO that will periodically health check:

all controllers have been running sync at least once in the last 5/10 minutes
on failure, produce a goroutine dump to analyse what went wrong

This check should not indicate whether the etcd cluster itself is healthy, it's purely for the CEO itself.

clones

OCPBUGS-11683 [4.13] Add Controller health to CEO liveness probe

Closed

relates to

ETCD-410 Add Controller health to CEO liveness probe

Closed

links to

openshift/cluster-etcd-operator#1049: OCPBUGS-14255: Add Controller health to CEO liveness probe

Assignee:: Thomas Jungblut

Reporter:: Thomas Jungblut

QA Contact:: Ge Liu

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/05/30 9:16 AM

Updated:: 2025/07/26 11:55 AM

Resolved:: 2023/07/06 8:54 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates