[OCPSTRAT-454] improve etcd dashboard, alerts & metrics - Red Hat Issue Tracker

Type: Feature
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: Core
Labels:
- FPC:TODO-Close-ALL-Epics
- etcd

Work Type:
BU Product Work
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Target Version:

openshift-4.14

Risk Score:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Priority Data:
PX Impact Score:
PX Review Complete:

Intelligence Requested:
Market:

Feature Overview (aka. Goal Summary)

There are many factors that influence the stability and health of an etcd cluster. To improve visibility for customers experiencing issues with etcd or using non-standard deployment (e.g., stretched control planes), we should provide additional metrics for easier troubleshooting or analysis of the failures.

Requirements (aka. Acceptance Criteria):

Update the dashboard and alerts to provide information on which node is failing. The current dashboard and alert only notify the number of members up or how many do not have a leader. This information is useful but it is missing the information on which node is failing to help the cluster admin target the troubleshooting efforts
The dashboard presents "peer proud trip time" and in another area the disk fsync latency "Disk Sync Duration"
- Include a Jitter or stddev of the disk fsync duration. This would help identify problems caused by storage provided by external storage arrays, which can experience inconsistent IOPS & throughput due to the saturation of the storage network. This also helps identify storage problems resulting from the abstraction by virtualization systems which can experience saturation during certain times of the day or for certain platform conditions.
- Include a network Jitter plot as perceived by etcd. This will be the equivalent of etcd network Jitter = max(etcd RTT) - min(etcd RTT)
- Include an end-to-end etcd latency plot (like a bullet chart https://www.patternfly.org/v4/charts/bullet-chart) comprised of disk latency + disk jitter + network latency + network jitter and with a maximum threshold equal to the system ETCD_HEARTBEAT_INTERVAL.
  - The same graph should visually segment a range of good, warning, and danger zones for the total composable RTT. For example, good for anything below 50%, warning 51-80%, and danger anything above the 80% of ETCD_HEARTBEAT_INTERVAL
  - An alarm should be generated for the warning and danger intervals.

incorporates

RFE-1788 Have additional information with alert etcdMembersDown

Closed

Assignee:: William Caban

Reporter:: William Caban

Doc Contact:: Matthew Werner

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/05/11 8:11 PM

Updated:: 2024/09/04 8:49 PM

Resolved:: 2024/05/15 4:46 PM

Details

Description

Feature Overview (aka. Goal Summary)

Requirements (aka. Acceptance Criteria):

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide