Uploaded image for project: 'OpenShift Over the Air'
  1. OpenShift Over the Air
  2. OTA-844

Add 'reason' to cluster_operator_up metric and ClusterOperatorDown alert

XMLWordPrintable

    • 3
    • False
    • None
    • False
    • OTA 228, OTA 229
    • Customer Facing

      To give users sub-component granularity about why they're getting a critical alert.

      We should continue to avoid the cardinality hit of including the full message in the metric, because we don't want to load Prometheus down with that many time-series. For message-level granularity, users still have to follow the oc ... or web-console links from the alert description.

      A downside of this approach is that it's possible to have operators with rapidly changing ClusterOperator Available=False reason. But that seems unlikely (it only has to be stable for ~10 minutes before ClusterOperatorDown fires), and we can revisit this approach if it crops up in practice.

            trking W. Trevor King
            trking W. Trevor King
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: