-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
False
-
None
-
False
-
-
Cluster Syncs failure could have many different root causes, some are actionable, some are non-actionable and some could be waiting for the fix to take effect. We had multiple cases where the update of an operator would fail and cause thousands of SyncSets to fail, this could flood our primary shift and effectively hide actionable alerts if all the alerts have the same message.
We need a quick way to differentiate and identify the alerts that need to be silenced from PagerDuty. A label could be added to the metric but this raises cardinality issues.
Done Criteria:
- A way to identify the cluster sync root cause from the alert is found or implemented