-
Epic
-
Resolution: Done
-
Critical
-
None
-
[etcd Spike] Increase the overall quality for OpenShift's OOTB alerting rules
-
False
-
False
-
Not Set
-
No
-
Not Set
-
Done
-
OCPPLAN-6068 - Increase the overall quality for OpenShift's OOTB alerting rules
-
Kubernetes-native Infrastructure
-
Not Set
-
OCPPLAN-6068Increase the overall quality for OpenShift's OOTB alerting rules
-
Not Set
-
Undefined
-
Telco 5G Core
Review the following etcd critical alerts
etcdInsufficientMembers | etcd? |
etcdMembersDown | etcd? |
etcdNoLeader | etcd? |
etcdGRPCRequestsSlow | etcd? |
etcdHighFsyncDurations | etcd? |
etcdBackendQuotaLowSpace | etcd? |
Problem statement
OpenShift Monitoring was first released with OpenShift v3.11 and alongside Prometheus and other new technologies, the Monitoring team started to ship best practices to how an administrator operates a Kubernetes cluster in form of out-of-the-box metrics, dashboards, and alerting rules. Since then, the prepackaged best practices grew significantly, helping with much better insights and integrations into so many OpenShift components so that customers spend less time defining or researching what's important on their own; increasing the value to move into OpenShift.
Now that we have a notable amount of customers on OpenShift 4, we realized that we haven't done any retrospective on our existing best practices, specifically on one aspect that matters quite a lot to our customers - alerting rules. With any "default" OpenShift installation we currently ship around ~170 rules in different forms, filled in with different information and mostly without telling our customers the call to action when they receive an individual alert. Both the scale of the clusters and the targeted availability level requires any Operator to be able to take action quickly when an OCP alert fires. But today, many customers do not know how to interpret our alerts.
Goals
We'd like to ask all teams to review their alerts and make sure that match the "Requirements" listed below. Obviously, this feature represents a larger initiative, but we want to start somewhere.
Therefore, for the 4.9 timeframe we ask teams to review this list of critical alerts part of the "default" OpenShift installation (excluding any alerts that come in from addons) and create an epic to review and improve (if necessary) these rules. In that timeframe, we do not ask to review "warning" alerts explicitly but if you have some circles left, please consider to review those as well.
Non Goals - out of scope
- Provide a ruleset expressing cluster health based on all received alerts: this feature stops at alert level, not at an above layer with correlations between alerts and root cause analysis.
- Anything outside reviewing and improving our current out-of-the-box alerting rules.
Requirements
- Review the severity of your alert and make sure that it must be a "critical" alert. Please bear in mind that critical alerts are something that require someone to get out of bed in the middle of the night and fix something right now or they will be faced with a disaster.
- Make sure you have the "summary" and "description" annotations filled out. Please see "Additional Notes" for more information.
- Don't use the "message" annotation anymore - it's a deprecated field. If you use it, please make sure to move the content into either "summary" or "description".
- Add a runbook (via "runbook_url" annotation) that explains what someone should do when receiving an alert. Please see "Additional Notes" for more information.
Use Cases
- As an OCP operator, I only receive alerts in the middle of the night if they are really critical.
- As an OCP operator, I clearly understand the meaning, impact and call to action when I receive an alert.
Background, and strategic fit
- The targeted persona is an OCP Operator at scale, with stringent SLA: Red Rat Telco consultants and Solution Architects feedback is critical before pushing the Feature to our end customers.
- Quality is a huge, very important topic and we shouldn't forget that this does not only apply to our software stack but to everything we ship and customers are using it. It is part of our overall experience and we should treat it accordingly.
Customer Considerations
- One customer feedback. TL;DR:
- Way too many alerts types
- Way too many alerts per hour
- No description of the impact of the alert on the cluster health
- Severity of the alert doesn't inform on the cluster health
- One customer alerts variety example
Additional Notes
Important annotations
summary
Essentially a more comprehensive and readable version of the alert name. Use a human-readable sentence, starting with a capital letter and ending with a period. Use a static string or, if dynamic expansion is needed, aim for expanding into the same string for alerts that are typically grouped together into one notification. In that way, it can be used as a common "headline" for all alerts in the notification template. Examples: Filesystem has less than 3% inodes left. (for the NodeFilesystemAlmostOutOfFiles alert mentioned above), Prometheus alert notification queue predicted to run full in less than 30m. (for the PrometheusNotificationQueueRunningFull alert mentioned above).
In short, "summary" will be matched into the notifications "headline" / title. Try to be as concise as possible.
description
A detailed description of a single alert, with most of the important information templated in. The description usually expands into a different string for every individual alert within a notification. A notification template can iterate through all the descriptions and format them into a list. Examples (again corresponding to the examples above): Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left., Alert notification queue of Prometheus %(prometheusName)s is running full.
In short, "description" will be matched into the body of your notification. Try to elaborate a little.
runbook_url
A link to a document that elaborates a lot more about the alert and what action someone should take after receiving it. A runbook can live as a Markdown file on a Github repository. Every runbook should have at least the following sections:
- Meaning (could be same as what you put into the "description")
- Impact (explain the impact on either the cluster, the infrastucture and/or the workload)
- Diagnosis (what to do after you received the alert and how you can troubleshoot the problem)
- Mitigation (possible remediation steps)|
Example
- name: node-exporter rules: - alert: NodeFilesystemSpaceFillingUp annotations: description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup summary: Filesystem is predicted to run out of space within the next 24 hours.
Additional Resources
- relates to
-
RFE-1788 Have additional information with alert etcdMembersDown
- Accepted
- links to