-
Bug
-
Resolution: Done
-
Normal
-
None
-
None
-
Quality / Stability / Reliability
-
1
-
False
-
-
False
-
-
-
MCO Core Sprint 46, MCO Core Sprint 47
-
Low
-
None
Similar to ACM-18001, several alerts have issues where we're taking a rate of a given metric and expect it to come above 10. Usually these metrics only increase rarely (for example once every 5 minute/spoke), which means the alert never fire unless the amount of spokes are super high.
A review of these should be done, and see if we want to fix them, or drop them altogether. Ideally the alerts would fire on error percentages (i.e if 20% of all requests fail) instead of using a static rate.
Alerts affected:
- ACMMetricsCollectorFederationError
- ACMMetricsCollectorForwardRemoteWriteError
- ACMUWLMetricsCollectorFederationError
- ACMUWLMetricsCollectorForwardRemoteWriteError
- https://github.com/stolostron/multicluster-observability-operator/blob/2f4693b554510a12daa162da6018bcc7793568b4/operators/endpointmetrics/pkg/collector/metrics_collector.go#L447 (metrics collector ones on spokes)
- is related to
-
ACM-18001 [2.13/main] ACMRemoteWriteError only fires on setups with 3000+ managed clusters
-
- Closed
-