-
Epic
-
Resolution: Done
-
Major
-
None
-
None
-
Alerting improvements
-
Done
-
OCPPLAN-6068 - Increase the overall quality for OpenShift's OOTB alerting rules
-
OCPPLAN-6068Increase the overall quality for OpenShift's OOTB alerting rules
-
0% To Do, 0% In Progress, 100% Done
-
Telco 5G Core
Goals
- Re-evaluate the severity of our "critical" alerting rules and adjust them of they are not critical for OpenShift.
- Improve description and summary for our "critical" alerting rules if necessary to make it clear what went wrong.
- Add runbooks to our "critical" alerting rules that clear up 1) what the impact of a rule is and 2) the call to action when someone receives that alert.
Non-Goals
- Any improvements to non-"critical" alerts.
Motivation
We currently run around 170 alerting rules and 50 of those are deemed as "Critical". Unfortunately, most of these alerts are not clearly described and/or do not have runbooks so that customers now immediately what to do.
Runbooks are important components of alerting, as current alerts tend to be not very self descriptive and as any component not just us can create alerts, it will be hard for the users to actually know what to do to fix the problem that are causing these alerts, e.g. which action to take.
There is already runbooks with actions in kubernetes-mixin and we could use that to have others contribute to them as well. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md#alert-name-nodefilesystemspacefillingup
Alternatives
Acceptance Criteria
- Verify the following "Critical" alerts:
- PrometheusBadConfig
- PrometheusRemoteStorageFailures
- PrometheusRemoteWriteBehind
- PrometheusRuleFailures
- PrometheusErrorSendingAlertsToAnyAlertmanager
- ThanosSidecarPrometheusDown
- ThanosSidecarUnhealthy
- ThanosQueryHttpRequestQueryErrorRateHigh
- ThanosQueryHttpRequestQueryRangeErrorRateHigh
- ThanosQueryInstantLatencyHigh
- ThanosQueryRangeLatencyHigh
- AlertmanagerFailedReload
- AlertmanagerMembersInconsistent
- AlertmanagerClusterFailedToSendAlerts
- AlertmanagerConfigInconsistent
- AlertmanagerClusterDown
- AlertmanagerClusterCrashlooping
- KubeStateMetricsListErrors
- KubeStateMetricsWatchErrors
- Runbooks linked via runbook_url
- Runbooks as markdown files in the https://github.com/openshift/runbooks repository
Risk and Assumptions
Documentation Considerations
- Probably only links to the runbooks.
Open Questions
Additional Notes
Runbooks should be contributed to https://github.com/openshift/runbooks.
- is related to
-
OCPPLAN-7730 Runbooks and Alerts for BM
- New
- links to