Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-3397

[CEE.neXT] User-workload monitoring "admins" should be able to write general purpose alerting rules that can span several namespaces.

XMLWordPrintable

    • Icon: Epic Epic
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • UWM cross namespace alerts
    • False
    • None
    • False
    • Not Selected
    • NEW
    • To Do
    • NEW
    • 40% To Do, 40% In Progress, 20% Done
    • 0

      Epic Goal

      • Allow user-defined monitoring administrators to define PrometheusRules objects spanning multiple/all user namespaces.

      Why is this important?

      • There's often a need to define similar alerting rules for multiple user namespaces (typically when the rule works on platform metrics such as kube-state-metrics or kubelet metrics).
      • In the current situation, such rule would have to be duplicated in each user namespace which doesn't scale well:
        • 100 expressions selecting 1 namespace each are more expensive than 1 expression selecting 100 namespaces.
        • updating 100 PrometheusRule resources is more time-consuming and error-prone than updating 1 PrometheusRule object.

      Scenarios

      1. A user-defined monitoring admin can provision a PrometheusRules object for which the PromQL expressions aren't scoped to the namespace where the object is defined.
      2. A cluster admin can forbid user-defined monitoring admins to use cross-namespace rules.

      Acceptance Criteria

      • CI - MUST be running successfully with tests automated
      • Release Technical Enablement - Provide necessary release enablement details and documents.
      • Follow FeatureGate Guidelines
      • ...

      Dependencies (internal and external)

      1. None (Prometheus-operator supports defining namespace-enforcement exceptions for PrometheusRules).

      Previous Work (Optional):

      1.  

      Open questions::

      In terms of risks:

      • UWM admins may configure rules which overload the platform Prometheus and Thanos Querier.
        • This is not very different from the current situation where ThanosRuler can run many UWM rules.
        • All requests go first through the Thanos Querier which should "protect" Prometheus from DoS queries (there's a hard limit of 4 in-flight queries per Thanos Querier pod).
      • UWM admins may configure rules that access platform metrics unavailable for application owners (e.g. without a namespace label or for an openshift-* label).
        • In practice, UWM admins already have access to these metrics so it isn't a big change.
        • It also enables use cases such as ROSA admin customers that can't deploy their platform alerts to openshift-monitoring today. With this new feature, the limitation will be lifted.

      Done Checklist

      • CI - CI is running, tests are automated and merged.
      • Release Enablement <link to Feature Enablement Presentation>
      • DEV - Upstream code and tests merged: <link to meaningful PR or GitHub Issue>
      • DEV - Upstream documentation merged: <link to meaningful PR or GitHub Issue>
      • DEV - Downstream build attached to advisory: <link to errata>
      • QE - Test plans in Polarion: <link or reference to Polarion>
      • QE - Automated tests merged: <link or reference to automated tests>
      • DOC - Downstream documentation merged: <link to meaningful PR>

            spasquie@redhat.com Simon Pasquier
            rhn-support-anisal Apurva Nisal
            Tai Gao Tai Gao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: