-
Task
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
Many internal Red Hat team is using ocm stage environment for provisioning clusters. There was an incident where hive controller was in "crashlooping" state and no new cluster were able to be provisioned. Since SRE(P/App) do not have alerting for this we were not able to proactively resolve the issue. Since the hive-controller pod was not available, existing clusters' management through hive was also affected.
// thread
https://coreos.slack.com/archives/CCX9DB894/p1651814223330529
Action item:
1. Identify if there is any such alert rule which does detect this in prod so that it can re-used for stage.
2. What is the impact of alerting SRE for stage hive