Loading...

Type: Epic
Resolution: Done
Priority: Critical
Fix Version/s: 2021Q2 Plan, openshift-4.9
Affects Version/s: None
Component/s: Etcd
Labels:
- 5GC

Epic Name:
[etcd Spike] Increase the overall quality for OpenShift's OOTB alerting rules
Blocked:
False
Ready:
False
Dev Approval:
Not Set
Discussed with Team:
No
Docs Approval:
Not Set
Epic Status:
Done
Feature Link:
OCPPLAN-6068 - Increase the overall quality for OpenShift's OOTB alerting rules
Organization Sponsor:
Kubernetes-native Infrastructure
PM Approval:
Not Set
Parent Link:
OCPPLAN-6068Increase the overall quality for OpenShift's OOTB alerting rules
QE Approval:
Not Set
Release Note Text:
Undefined
Product Sponsor:
Telco 5G Core

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Portfolio Solutions:

Market:

Review the following etcd critical alerts

etcdInsufficientMembers	etcd?
etcdMembersDown	etcd?
etcdNoLeader	etcd?
etcdGRPCRequestsSlow	etcd?
etcdHighFsyncDurations	etcd?
etcdBackendQuotaLowSpace	etcd?

Problem statement

OpenShift Monitoring was first released with OpenShift v3.11 and alongside Prometheus and other new technologies, the Monitoring team started to ship best practices to how an administrator operates a Kubernetes cluster in form of out-of-the-box metrics, dashboards, and alerting rules. Since then, the prepackaged best practices grew significantly, helping with much better insights and integrations into so many OpenShift components so that customers spend less time defining or researching what's important on their own; increasing the value to move into OpenShift.

Now that we have a notable amount of customers on OpenShift 4, we realized that we haven't done any retrospective on our existing best practices, specifically on one aspect that matters quite a lot to our customers - alerting rules. With any "default" OpenShift installation we currently ship around ~170 rules in different forms, filled in with different information and mostly without telling our customers the call to action when they receive an individual alert. Both the scale of the clusters and the targeted availability level requires any Operator to be able to take action quickly when an OCP alert fires. But today, many customers do not know how to interpret our alerts.

Goals

We'd like to ask all teams to review their alerts and make sure that match the "Requirements" listed below. Obviously, this feature represents a larger initiative, but we want to start somewhere.

Therefore, for the 4.9 timeframe we ask teams to review this list of critical alerts part of the "default" OpenShift installation (excluding any alerts that come in from addons) and create an epic to review and improve (if necessary) these rules. In that timeframe, we do not ask to review "warning" alerts explicitly but if you have some circles left, please consider to review those as well.

Non Goals - out of scope

Provide a ruleset expressing cluster health based on all received alerts: this feature stops at alert level, not at an above layer with correlations between alerts and root cause analysis.
Anything outside reviewing and improving our current out-of-the-box alerting rules.

Requirements

Review the severity of your alert and make sure that it must be a "critical" alert. Please bear in mind that critical alerts are something that require someone to get out of bed in the middle of the night and fix something right now or they will be faced with a disaster.
Make sure you have the "summary" and "description" annotations filled out. Please see "Additional Notes" for more information.
Don't use the "message" annotation anymore - it's a deprecated field. If you use it, please make sure to move the content into either "summary" or "description".
Add a runbook (via "runbook_url" annotation) that explains what someone should do when receiving an alert. Please see "Additional Notes" for more information.

Use Cases

As an OCP operator, I only receive alerts in the middle of the night if they are really critical.
As an OCP operator, I clearly understand the meaning, impact and call to action when I receive an alert.

Background, and strategic fit

The targeted persona is an OCP Operator at scale, with stringent SLA: Red Rat Telco consultants and Solution Architects feedback is critical before pushing the Feature to our end customers.
Quality is a huge, very important topic and we shouldn't forget that this does not only apply to our software stack but to everything we ship and customers are using it. It is part of our overall experience and we should treat it accordingly.

Customer Considerations

One customer feedback. TL;DR:
- Way too many alerts types
- Way too many alerts per hour
- No description of the impact of the alert on the cluster health
- Severity of the alert doesn't inform on the cluster health
One customer alerts variety example

Additional Notes

Important annotations

summary

Essentially a more comprehensive and readable version of the alert name. Use a human-readable sentence, starting with a capital letter and ending with a period. Use a static string or, if dynamic expansion is needed, aim for expanding into the same string for alerts that are typically grouped together into one notification. In that way, it can be used as a common "headline" for all alerts in the notification template. Examples: Filesystem has less than 3% inodes left. (for the NodeFilesystemAlmostOutOfFiles alert mentioned above), Prometheus alert notification queue predicted to run full in less than 30m. (for the PrometheusNotificationQueueRunningFull alert mentioned above).

In short, "summary" will be matched into the notifications "headline" / title. Try to be as concise as possible.

description

A detailed description of a single alert, with most of the important information templated in. The description usually expands into a different string for every individual alert within a notification. A notification template can iterate through all the descriptions and format them into a list. Examples (again corresponding to the examples above): Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left., Alert notification queue of Prometheus %(prometheusName)s is running full.

In short, "description" will be matched into the body of your notification. Try to elaborate a little.

runbook_url

A link to a document that elaborates a lot more about the alert and what action someone should take after receiving it. A runbook can live as a Markdown file on a Github repository. Every runbook should have at least the following sections:

Meaning (could be same as what you put into the "description")
Impact (explain the impact on either the cluster, the infrastucture and/or the workload)
Diagnosis (what to do after you received the alert and how you can troubleshoot the problem)
Mitigation (possible remediation steps)|

Example

- name: node-exporter
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.

Additional Resources

relates to

RFE-1788 Have additional information with alert etcdMembersDown

Closed

links to

openshift/cluster-etcd-operator#627: Replace message with description field

1.	Remove etcd alert mixin from cluster-monitoring-operator	Closed	Unassigned
2.	Integrate etcd mixin into cluster-etcd-operator	Closed	Unassigned
3.	Enable extensive metrics	Closed	Unassigned
4.	Review open bugzillas and close them	Closed	Unassigned
5.	Review alerts and adjust to OpenShift environment	Closed	Unassigned
6.	Write runbooks	Closed	Unassigned
7.	Change message field to description or summary	Closed	Unassigned

Details

Description

Review the following etcd critical alerts

Problem statement

Goals

Non Goals - out of scope

Requirements

Use Cases

Background, and strategic fit

Customer Considerations

Additional Notes

Additional Resources

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates