XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Major
Fix Version/s: openshift-4.9
Affects Version/s: None
Component/s: None
Labels:
- doc-ack
- feature
- groomed
- pm-request
- px-ack
- qe-ack

Epic Name:
Alerting improvements
Epic Status:
Done
Feature Link:
OCPPLAN-6068 - Increase the overall quality for OpenShift's OOTB alerting rules
Parent Link:
OCPPLAN-6068Increase the overall quality for OpenShift's OOTB alerting rules
Hierarchy Progress Bar:

0% To Do, 0% In Progress, 100% Done
Product Sponsor:
Telco 5G Core

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Goals

Re-evaluate the severity of our "critical" alerting rules and adjust them of they are not critical for OpenShift.
Improve description and summary for our "critical" alerting rules if necessary to make it clear what went wrong.
Add runbooks to our "critical" alerting rules that clear up 1) what the impact of a rule is and 2) the call to action when someone receives that alert.

Non-Goals

Any improvements to non-"critical" alerts.

Motivation

We currently run around 170 alerting rules and 50 of those are deemed as "Critical". Unfortunately, most of these alerts are not clearly described and/or do not have runbooks so that customers now immediately what to do.

Runbooks are important components of alerting, as current alerts tend to be not very self descriptive and as any component not just us can create alerts, it will be hard for the users to actually know what to do to fix the problem that are causing these alerts, e.g. which action to take.

There is already runbooks with actions in kubernetes-mixin and we could use that to have others contribute to them as well. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md#alert-name-nodefilesystemspacefillingup

Alternatives

Acceptance Criteria

Verify the following "Critical" alerts:
- PrometheusBadConfig
- PrometheusRemoteStorageFailures
- PrometheusRemoteWriteBehind
- PrometheusRuleFailures
- PrometheusErrorSendingAlertsToAnyAlertmanager
- ThanosSidecarPrometheusDown
- ThanosSidecarUnhealthy
- ThanosQueryHttpRequestQueryErrorRateHigh
- ThanosQueryHttpRequestQueryRangeErrorRateHigh
- ThanosQueryInstantLatencyHigh
- ThanosQueryRangeLatencyHigh
- AlertmanagerFailedReload
- AlertmanagerMembersInconsistent
- AlertmanagerClusterFailedToSendAlerts
- AlertmanagerConfigInconsistent
- AlertmanagerClusterDown
- AlertmanagerClusterCrashlooping
- KubeStateMetricsListErrors
- KubeStateMetricsWatchErrors
Runbooks linked via runbook_url
Runbooks as markdown files in the https://github.com/openshift/runbooks repository

Risk and Assumptions

Documentation Considerations

Probably only links to the runbooks.

Open Questions

Additional Notes

Runbooks should be contributed to https://github.com/openshift/runbooks.

is related to

OCPPLAN-7730 Runbooks and Alerts for BM

links to

OpenShift enhancement proposal on alerting consistency

Assignee:: Brad Ison (Inactive)

Reporter:: Ljiljana Cosic (Inactive)

QA Contact:: Hongyan Li

Votes:: 0 Vote for this issue

Watchers:: 16 Start watching this issue

Created:: 2020/01/14 7:26 AM

Updated:: 2022/08/26 2:27 PM

Resolved:: 2021/09/08 3:02 PM

Details

Description

Goals

Non-Goals

Motivation

Alternatives

Acceptance Criteria

Risk and Assumptions

Documentation Considerations

Open Questions

Additional Notes

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide