Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: Observability
Labels:
- acmalertmanager
- triaged

Activity Type:
Quality / Stability / Reliability
Story Points:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Intelligence Requested:
Market:

Sprint:
MCO Core Sprint 46, MCO Core Sprint 47
Severity:
Low

Regression:
None

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

PX Impact Score:

Similar to ~~ACM-18001~~, several alerts have issues where we're taking a rate of a given metric and expect it to come above 10. Usually these metrics only increase rarely (for example once every 5 minute/spoke), which means the alert never fire unless the amount of spokes are super high.

A review of these should be done, and see if we want to fix them, or drop them altogether. Ideally the alerts would fire on error percentages (i.e if 20% of all requests fail) instead of using a static rate.

Alerts affected:

ACMMetricsCollectorFederationError
ACMMetricsCollectorForwardRemoteWriteError
ACMUWLMetricsCollectorFederationError
ACMUWLMetricsCollectorForwardRemoteWriteError
https://github.com/stolostron/multicluster-observability-operator/blob/2f4693b554510a12daa162da6018bcc7793568b4/operators/endpointmetrics/pkg/collector/metrics_collector.go#L447 (metrics collector ones on spokes)

is related to

ACM-18001 [2.13/main] ACMRemoteWriteError only fires on setups with 3000+ managed clusters

Closed

Assignee:: Daniel Buchanan

Reporter:: Jacob Baungard Hansen

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2025/02/25 12:23 PM

Updated:: 2025/09/10 12:04 PM

Resolved:: 2025/09/10 12:04 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates