Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2286

Identify a reason in the ClusterSyncFailingSeconds metric

XMLWordPrintable

    • False
    • None
    • False

      Cluster Syncs failure could have many different root causes, some are actionable, some are non-actionable and some could be waiting for the fix to take effect. We had multiple cases where the update of an operator would fail and cause thousands of SyncSets to fail, this could flood our primary shift and effectively hide actionable alerts if all the alerts have the same message. 

       

      We need a quick way to differentiate and identify the alerts that need to be silenced from PagerDuty. A label could be added to the metric but this raises cardinality issues. 

       

      Done Criteria:

      • A way to identify the cluster sync root cause from the alert is found or implemented

              sumehta Suhani Mehta
              zmird.openshift Zakaria Mird
              Jianping Shu Jianping Shu
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: