Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-2358

Add runbook for PrometheusOperatorRejectedResources alert

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • prometheus-operator
    • False
    • None
    • False
    • NEW
    • NEW
    • MON Sprint 246
    • 0

      In OCP 4.10, the PrometheusOperatorRejectedResources alert (along with other alerts related to Prometheus operator) has been extended to cover the openshift-user-workload-monitoring namespace.

      The CCX team has seen that about 5% of 4.10 clusters have the alert firing for openshift-user-workload-monitoring. In practice it means that some of the user-defined pod/service monitors aren't valid (like invalid scrape interval values or references to missing secrets for scrape authentication).

      Eventually we want the Prometheus operator to be more user-friendly and provide direct feedback to the users:

      1. Do more with OpenAPI spec validations
      2. Implement/configure validating webhooks for things that can't be modeled directly with OpenAPI.
      3. Implement the status subresource for service/pod monitors.

      But in the mean time, the alert description should be improved to include more details about the cause and how to mitigate the issue. In the same way, we need to add a runbook in github.com/openshift/runbooks and link it in the CMO alert.

      [1] https://github.com/openshift/cluster-monitoring-operator/pull/1370

       

      DoD

      • Improved the description & summary annotations of the upstream alerts.
      • Dedicated runbook in openshift/runbooks.
      • Everything pulled together in CMO.

            hasun@redhat.com Haoyu Sun
            spasquie@redhat.com Simon Pasquier
            Simon Pasquier
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: