Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-2286

Identify a reason in the ClusterSyncFailingSeconds metric

XMLWordPrintable

    • False
    • None
    • False

      Cluster Syncs failure could have many different root causes, some are actionable, some are non-actionable and some could be waiting for the fix to take effect. We had multiple cases where the update of an operator would fail and cause thousands of SyncSets to fail, this could flood our primary shift and effectively hide actionable alerts if all the alerts have the same message. 

       

      We need a quick way to differentiate and identify the alerts that need to be silenced from PagerDuty. A label could be added to the metric but this raises cardinality issues. 

       

      Done Criteria:

      • A way to identify the cluster sync root cause from the alert is found or implemented

            sumehta Suhani Mehta
            zmird.openshift Zakaria Mird
            Jianping Shu Jianping Shu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: