Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-17999

[2.11] ACMRemoteWriteError only fires on setups with 3000+ managed clusters

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • ACM 2.11.7
    • ACM 2.11.Z
    • Observability
    • None
    • False
    • None
    • False
    • Low
    • None

      Description of problem:

      The alert "ACMRemoteWriteError" is supposed to fire, whenever there are problems writing to external remote write endpoints such as victoriametrics. This functionality works by Observatorium-API fowarding requests from metrics-collector to both the Hub thanos-receive and any external endpoints.

      This means that for each managed cluster, we expect one remote write request every 5 minutes.

      The alert however is defined as below. The alert is calculating the average rate per second (in the last 5 minutes) of non 200 response codes, and only fires when this comes above 10. This means that we only fire the alert when there are more than 10 errors per second. Given we only expect one remote write request per managed cluster every 5 minutes, that means the alert never reaches the threshold unless the number of manager clusters exceed 10*60*5 = 3000 at the default 5 minute scrape interval.

      sum by (code)(rate(acm_remote_write_requests_total{code!~"2.*"}[5m])) > 10

      Fix the above, by for example using `increase` instead and ensure that we alert if i.e a significant number of managed clusters failed (maybe 10% ?)

      Version-Release number of selected component (if applicable):

      2.10+

      How reproducible:

      always

      Steps to Reproduce:

      1. Setup an external write endpoint
      2. Take down the external write endpoint
      3. Check that the alert never fires

      Actual results:

      The alert never fires

      Expected results:

      The alert fires

      Additional info:

              mzardab@redhat.com Moad Zardab
              rh-ee-jachanse Jacob Baungard Hansen
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: