-
Bug
-
Resolution: Done
-
Normal
-
None
-
ACM 2.13.0
-
Quality / Stability / Reliability
-
3
-
False
-
-
False
-
-
-
-
Observability Sprint 39, Observability Sprint 40
-
Important
-
None
Context
Many clients are handling the cluster-monitoring-config through gitops, applying conflicting updates with what the endpoint-operator adds. This triggers conflicting reconcile loops on both ends. On every configuration change, prometheus is restarted, disrupting the monitoring and alerting.
What
Define a way to surface these reconcile loops to the user.
Possible solutions:
Add a platform serviceMonitor for the endpoint-monitor with alert rule to alert on this case. This is possibly the best solution but might not work if Prometheus is constantly restarting without being able to scrape metrics.- Detect these loops from inside the operator and degrade the addon state with relevant message
Acceptance criteria:
- Implement solution that detect these loops from inside the operator and degrade the addon state with relevant message
- Write troubleshooting steps for what to do when in this situation (ACM docs or KCS)