Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-927

Improve our alerting rules to clear confusion to what they do, the impact, and the call to action

XMLWordPrintable

    • Alerting improvements
    • Done
    • OCPPLAN-6068 - Increase the overall quality for OpenShift's OOTB alerting rules
    • OCPPLAN-6068Increase the overall quality for OpenShift's OOTB alerting rules
    • 0% To Do, 0% In Progress, 100% Done
    • Telco 5G Core

      Goals

      • Re-evaluate the severity of our "critical" alerting rules and adjust them of they are not critical for OpenShift.
      • Improve description and summary for our "critical" alerting rules if necessary to make it clear what went wrong.
      • Add runbooks to our "critical" alerting rules that clear up 1) what the impact of a rule is and 2) the call to action when someone receives that alert.

      Non-Goals

      • Any improvements to non-"critical" alerts.

      Motivation

      We currently run around 170 alerting rules and 50 of those are deemed as "Critical". Unfortunately, most of these alerts are not clearly described and/or do not have runbooks so that customers now immediately what to do.

      Runbooks are important components of alerting, as current alerts tend to be not very self descriptive and as any component not just us can create alerts, it will be hard for the users to actually know what to do to fix the problem that are causing these alerts, e.g. which action to take.

      There is already runbooks with actions in kubernetes-mixin and we could use that to have others contribute to them as well. https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/master/runbook.md#alert-name-nodefilesystemspacefillingup

      Alternatives

      Acceptance Criteria

      • Verify the following "Critical" alerts:
        • PrometheusBadConfig
        • PrometheusRemoteStorageFailures
        • PrometheusRemoteWriteBehind
        • PrometheusRuleFailures
        • PrometheusErrorSendingAlertsToAnyAlertmanager
        • ThanosSidecarPrometheusDown
        • ThanosSidecarUnhealthy
        • ThanosQueryHttpRequestQueryErrorRateHigh
        • ThanosQueryHttpRequestQueryRangeErrorRateHigh
        • ThanosQueryInstantLatencyHigh
        • ThanosQueryRangeLatencyHigh
        • AlertmanagerFailedReload
        • AlertmanagerMembersInconsistent
        • AlertmanagerClusterFailedToSendAlerts
        • AlertmanagerConfigInconsistent
        • AlertmanagerClusterDown
        • AlertmanagerClusterCrashlooping
        • KubeStateMetricsListErrors
        • KubeStateMetricsWatchErrors
      • Runbooks linked via runbook_url
      • Runbooks as markdown files in the https://github.com/openshift/runbooks repository

      Risk and Assumptions

      Documentation Considerations

      • Probably only links to the runbooks.

      Open Questions

      Additional Notes

      Runbooks should be contributed to https://github.com/openshift/runbooks.

              rhn-coreos-brison Brad Ison (Inactive)
              lcosic@redhat.com Ljiljana Cosic (Inactive)
              Hongyan Li Hongyan Li
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated:
                Resolved: