Uploaded image for project: 'OpenShift Hive'
  1. OpenShift Hive
  2. HIVE-1892

Alert SRE (P/App) if there is an issue with hive component

XMLWordPrintable

    • False
    • None
    • False

      Many internal Red Hat team is using ocm stage environment for provisioning clusters. There was an incident where hive controller was in "crashlooping" state and no new cluster were able to be provisioned. Since SRE(P/App) do not have alerting for this we were not able to proactively resolve the issue. Since the hive-controller pod was not available, existing clusters' management through hive was also affected.

      // thread
      https://coreos.slack.com/archives/CCX9DB894/p1651814223330529

      Action item:
      1. Identify if there is any such alert rule which does detect this in prod so that it can re-used for stage.
      2. What is the impact of alerting SRE for stage hive

            efried.openshift Eric Fried
            rhn-support-aabhishe Abhishek Abhishek
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: