-
Story
-
Resolution: Done
-
Major
-
None
-
None
As an SREP engineer, I need to know when an individual cluster is failing clustersync. We have had several instances in just the last few weeks where failing to sync has caused issues on the cluster, including failing to update alerting rules, operator updates (preventing bugfixes), etc.
We need a way to identify these clusters on a cluster-by-cluster basis, as opposed to failing selectorSyncSets, which do not surface the failing cluster and therefore complicate the alerting and troubleshooting process, and will not scale. I'd argue we're well past the scale point already.
This is a pretty high priority need for SREP at the moment.
Done Criteria:
- SREP has an endpoint or other mechanism to automatically query and alert when an individual cluster is failing clustersync
- If this is an alert provided by hive a la cluster provisioning delays, it should perhaps have a threshold of 1 day, to allow for transient sync failures.
I am happy to discuss, answer questions and/or work with whoever is taking this task on.