-
Bug
-
Resolution: Unresolved
-
Normal
-
ACM 2.11.Z
-
None
-
False
-
None
-
False
-
-
-
Low
-
None
Description of problem:
The alert "ACMRemoteWriteError" is supposed to fire, whenever there are problems writing to external remote write endpoints such as victoriametrics. This functionality works by Observatorium-API fowarding requests from metrics-collector to both the Hub thanos-receive and any external endpoints.
This means that for each managed cluster, we expect one remote write request every 5 minutes.
The alert however is defined as below. The alert is calculating the average rate per second (in the last 5 minutes) of non 200 response codes, and only fires when this comes above 10. This means that we only fire the alert when there are more than 10 errors per second. Given we only expect one remote write request per managed cluster every 5 minutes, that means the alert never reaches the threshold unless the number of manager clusters exceed 10*60*5 = 3000 at the default 5 minute scrape interval.
sum by (code)(rate(acm_remote_write_requests_total{code!~"2.*"}[5m])) > 10
Fix the above, by for example using `increase` instead and ensure that we alert if i.e a significant number of managed clusters failed (maybe 10% ?)
Version-Release number of selected component (if applicable):
2.10+
How reproducible:
always
Steps to Reproduce:
- Setup an external write endpoint
- Take down the external write endpoint
- Check that the alert never fires
Actual results:
The alert never fires
Expected results:
The alert fires