Uploaded image for project: 'OpenShift Request For Enhancement'
  1. OpenShift Request For Enhancement
  2. RFE-1620

Alertmanager rule for Priority And Fairness feature

XMLWordPrintable

    • False
    • False
    • Undefined

      1. Proposed title of this feature request

      Alertmanager rule for Priority And Fairness feature

      2. What is the nature and description of the request?

      Since OpenShift Container Platform 4.5, the "Priority and Fairness" (P&F( feature is enabled by default. There have been some issues with this feature also throttling internal communication leading to API outages. This RFE is not about these cases (these are bugs, see BZ#1912566, BZ#1825219 etc), but about surfacing that the P&F feature is actively throttling requests. This may indicate an overloaded API or malfunctioning workload.

      An example of a helpful query is the following to track current executing requests:

      ~~~
      `sum(apiserver_flowcontrol_current_executing_requests) by (flowSchema,priorityLevel)`
      ~~~

      The above is an example query, not necessarily what needs to be included in this feature. Potentially a new metric for the P&F feature needs to be created to have a more sensible metric as a basis for this alert. There may also be other queries that report similar or do make even more sense to include, this has to be checked by Engineering.

      3. Why does the customer need this? (List the business requirements here)

      The customer noted that having such an alert would have prevented longer API outages and would have surfaced throttling issues in the cluster much quicker. Having such an Alertmanager rule would increase the platform stability

      4. List any affected packages or components.

      Alertmanager

            wcabanba@redhat.com William Caban
            rhn-support-skrenger Simon Krenger
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: