-
Feature
-
Resolution: Done
-
Major
-
None
-
None
Feature Overview (aka. Goal Summary)
There are many factors that influence the stability and health of an etcd cluster. To improve visibility for customers experiencing issues with etcd or using non-standard deployment (e.g., stretched control planes), we should provide additional metrics for easier troubleshooting or analysis of the failures.
Requirements (aka. Acceptance Criteria):
- Update the dashboard and alerts to provide information on which node is failing. The current dashboard and alert only notify the number of members up or how many do not have a leader. This information is useful but it is missing the information on which node is failing to help the cluster admin target the troubleshooting efforts
- The dashboard presents "peer proud trip time" and in another area the disk fsync latency "Disk Sync Duration"
- Include a Jitter or stddev of the disk fsync duration. This would help identify problems caused by storage provided by external storage arrays, which can experience inconsistent IOPS & throughput due to the saturation of the storage network. This also helps identify storage problems resulting from the abstraction by virtualization systems which can experience saturation during certain times of the day or for certain platform conditions.
- Include a network Jitter plot as perceived by etcd. This will be the equivalent of etcd network Jitter = max(etcd RTT) - min(etcd RTT)
- Include an end-to-end etcd latency plot (like a bullet chart https://www.patternfly.org/v4/charts/bullet-chart) comprised of disk latency + disk jitter + network latency + network jitter and with a maximum threshold equal to the system ETCD_HEARTBEAT_INTERVAL.
- The same graph should visually segment a range of good, warning, and danger zones for the total composable RTT. For example, good for anything below 50%, warning 51-80%, and danger anything above the 80% of ETCD_HEARTBEAT_INTERVAL
- An alarm should be generated for the warning and danger intervals.
- incorporates
-
RFE-1788 Have additional information with alert etcdMembersDown
- Accepted