Loading...

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: 2021Q3 Plan, 2021Q4 Plan, openshift-4.10
Affects Version/s: None
Component/s: None
Labels:
- 5GC
- doc-ack
- px-ack
- qe-ack
- tc-approved

Epic Name:
[api-server]Increase the overall quality for OpenShift's OOTB alerting rules
Epic Status:
Done
Activity Type:
None
Blocked:
False
Blocked Reason:
None
Ready:
False
Size:
None

Target Version:
None
Release Blocker:
None

Problem statement

OpenShift Monitoring was first released with OpenShift v3.11 and alongside Prometheus and other new technologies, the Monitoring team started to ship best practices to how an administrator operates a Kubernetes cluster in form of out-of-the-box metrics, dashboards, and alerting rules. Since then, the prepackaged best practices grew significantly, helping with much better insights and integrations into so many OpenShift components so that customers spend less time defining or researching what’s important on their own; increasing the value to move into OpenShift.

Now that we have a notable amount of customers on OpenShift 4, we realized that we haven’t done any retrospective on our existing best practices, specifically on one aspect that matters quite a lot to our customers – alerting rules. With any “default” OpenShift installation we currently ship around ~170 rules in different forms, filled in with different information and mostly without telling our customers the call to action when they receive an individual alert. Both the scale of the clusters and the targeted availability level requires any Operator to be able to take action quickly when an OCP alert fires. But today, many customers do not know how to interpret our alerts.

Goals

We’d like to ask all teams to review their alerts and make sure that match the "Requirements" listed below. Obviously, this feature represents a larger initiative, but we want to start somewhere.

Therefore, for the 4.9 timeframe we ask teams to review this list of critical alerts part of the "default" OpenShift installation (excluding any alerts that come in from addons) and create an epic to review and improve (if necessary) these rules. In that timeframe, we do not ask to review "warning" alerts explicitly but if you have some circles left, please consider to review those as well.

Non Goals - out of scope

Provide a ruleset expressing cluster health based on all received alerts: this feature stops at alert level, not at an above layer with correlations between alerts and root cause analysis.
Anything outside reviewing and improving our current out-of-the-box alerting rules.

Requirements

Review the severity of your alert and make sure that it must be a “critical” alert. Please bear in mind that critical alerts are something that require someone to get out of bed in the middle of the night and fix something right now or they will be faced with a disaster.
Make sure you have the “summary” and “description” annotations filled out. Please see "Additional Notes" for more information.
Don’t use the “message” annotation anymore – it’s a deprecated field. If you use it, please make sure to move the content into either “summary” or “description”.
Add a runbook (via "runbook_url" annotation) that explains what someone should do when receiving an alert. Please see "Additional Notes" for more information.

Use Cases

As an OCP operator, I only receive alerts in the middle of the night if they are really critical.
As an OCP operator, I clearly understand the meaning, impact and call to action when I receive an alert.

Background, and strategic fit

The targeted persona is an OCP Operator at scale, with stringent SLA: Red Rat Telco consultants and Solution Architects feedback is critical before pushing the Feature to our end customers.
Quality is a huge, very important topic and we shouldn’t forget that this does not only apply to our software stack but to everything we ship and customers are using it. It is part of our overall experience and we should treat it accordingly.

Customer Considerations

One customer feedback. TL;DR:
- Way too many alerts types
- Way too many alerts per hour
- No description of the impact of the alert on the cluster health
- Severity of the alert doesn’t inform on the cluster health
One customer alerts variety example

Additional Notes

Important annotations

summary

Essentially a more comprehensive and readable version of the alert name. Use a human-readable sentence, starting with a capital letter and ending with a period. Use a static string or, if dynamic expansion is needed, aim for expanding into the same string for alerts that are typically grouped together into one notification. In that way, it can be used as a common “headline” for all alerts in the notification template. Examples: Filesystem has less than 3% inodes left. (for the NodeFilesystemAlmostOutOfFiles alert mentioned above), Prometheus alert notification queue predicted to run full in less than 30m. (for the PrometheusNotificationQueueRunningFull alert mentioned above).

In short, "summary" will be matched into the notifications "headline" / title. Try to be as concise as possible.

description

A detailed description of a single alert, with most of the important information templated in. The description usually expands into a different string for every individual alert within a notification. A notification template can iterate through all the descriptions and format them into a list. Examples (again corresponding to the examples above): Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available inodes left., Alert notification queue of Prometheus %(prometheusName)s is running full.

In short, "description" will be matched into the body of your notification. Try to elaborate a little.

runbook_url

A link to a document that elaborates a lot more about the alert and what action someone should take after receiving it. A runbook can live as a Markdown file on a Github repository. Every runbook should have at least the following sections:

Meaning (could be same as what you put into the "description")
Impact (explain the impact on either the cluster, the infrastucture and/or the workload)
Diagnosis (what to do after you received the alert and how you can troubleshoot the problem)
Mitigation (possible remediation steps)|

Example

- name: node-exporter
    rules:
    - alert: NodeFilesystemSpaceFillingUp
      annotations:
        description: Filesystem on {{ $labels.device }} at {{ $labels.instance }} has only {{ printf "%.2f" $value }}% available space left and is filling up.
        runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-nodefilesystemspacefillingup
        summary: Filesystem is predicted to run out of space within the next 24 hours.

Additional Resources

is cloned by

API-1367 Improve apiserver alerting

Closed

is related to

OCPPLAN-6068 Increase the overall quality for OpenShift's OOTB alerting rules

Closed

links to

openshift/cluster-kube-apiserver-operator#1267: reduce alert severity to warning where appropriate

There are no Sub-Tasks for this issue.

Details

Description

Problem statement

Goals

Non Goals - out of scope

Requirements

Use Cases

Background, and strategic fit

Customer Considerations

Additional Notes

Additional Resources

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates