Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-1857

Identify individual clusters failing cluster sync

XMLWordPrintable

    • False
    • None
    • False

      As an SREP engineer, I need to know when an individual cluster is failing clustersync. We have had several instances in just the last few weeks where failing to sync has caused issues on the cluster, including failing to update alerting rules, operator updates (preventing bugfixes), etc.

      We need a way to identify these clusters on a cluster-by-cluster basis, as opposed to failing selectorSyncSets, which do not surface the failing cluster and therefore complicate the alerting and troubleshooting process, and will not scale. I'd argue we're well past the scale point already.

      This is a pretty high priority need for SREP at the moment.

      Done Criteria:

      • SREP has an endpoint or other mechanism to automatically query and alert when an individual cluster is failing clustersync
      • If this is an alert provided by hive a la cluster provisioning delays, it should perhaps have a threshold of 1 day, to allow for transient sync failures.

      I am happy to discuss, answer questions and/or work with whoever is taking this task on.

            sumehta Suhani Mehta
            chcollin Chris Collins
            Mingxia Huang Mingxia Huang
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: