Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-2358

Add runbook for PrometheusOperatorRejectedResources alert

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • prometheus-operator
    • False
    • None
    • False
    • NEW
    • NEW
    • MON Sprint 246

      In OCP 4.10, the PrometheusOperatorRejectedResources alert (along with other alerts related to Prometheus operator) has been extended to cover the openshift-user-workload-monitoring namespace.

      The CCX team has seen that about 5% of 4.10 clusters have the alert firing for openshift-user-workload-monitoring. In practice it means that some of the user-defined pod/service monitors aren't valid (like invalid scrape interval values or references to missing secrets for scrape authentication).

      Eventually we want the Prometheus operator to be more user-friendly and provide direct feedback to the users:

      1. Do more with OpenAPI spec validations
      2. Implement/configure validating webhooks for things that can't be modeled directly with OpenAPI.
      3. Implement the status subresource for service/pod monitors.

      But in the mean time, the alert description should be improved to include more details about the cause and how to mitigate the issue. In the same way, we need to add a runbook in github.com/openshift/runbooks and link it in the CMO alert.

      [1] https://github.com/openshift/cluster-monitoring-operator/pull/1370

       

      DoD

      • Improved the description & summary annotations of the upstream alerts.
      • Dedicated runbook in openshift/runbooks.
      • Everything pulled together in CMO.

              hasun@redhat.com Haoyu Sun
              spasquie@redhat.com Simon Pasquier
              Simon Pasquier
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: