Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: ACM 2.11.7
Affects Version/s: ACM 2.11.Z
Component/s: Observability
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

Severity:
Low

Regression:
None

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Description of problem:

The alert "ACMRemoteWriteError" is supposed to fire, whenever there are problems writing to external remote write endpoints such as victoriametrics. This functionality works by Observatorium-API fowarding requests from metrics-collector to both the Hub thanos-receive and any external endpoints.

This means that for each managed cluster, we expect one remote write request every 5 minutes.

The alert however is defined as below. The alert is calculating the average rate per second (in the last 5 minutes) of non 200 response codes, and only fires when this comes above 10. This means that we only fire the alert when there are more than 10 errors per second. Given we only expect one remote write request per managed cluster every 5 minutes, that means the alert never reaches the threshold unless the number of manager clusters exceed 10*60*5 = 3000 at the default 5 minute scrape interval.

sum by (code)(rate(acm_remote_write_requests_total{code!~"2.*"}[5m])) > 10

Fix the above, by for example using `increase` instead and ensure that we alert if i.e a significant number of managed clusters failed (maybe 10% ?)

Version-Release number of selected component (if applicable):

2.10+

How reproducible:

always

Steps to Reproduce:

Setup an external write endpoint
Take down the external write endpoint
Check that the alert never fires

Actual results:

The alert never fires

Expected results:

The alert fires

Additional info:

clones

ACM-17998 [2.10] ACMRemoteWriteError only fires on setups with 3000+ managed clusters

is cloned by

ACM-18000 [2.12] ACMRemoteWriteError only fires on setups with 3000+ managed clusters

Assignee:: Moad Zardab

Reporter:: Jacob Baungard Hansen

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/02/19 8:35 AM

Updated:: 2025/02/19 8:39 AM

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates