-
Story
-
Resolution: Done
-
Normal
-
None
-
openshift-4.13
-
None
-
3
-
False
-
None
-
False
-
-
-
OTA 228, OTA 229
-
Customer Facing
To give users sub-component granularity about why they're getting a critical alert.
We should continue to avoid the cardinality hit of including the full message in the metric, because we don't want to load Prometheus down with that many time-series. For message-level granularity, users still have to follow the oc ... or web-console links from the alert description.
A downside of this approach is that it's possible to have operators with rapidly changing ClusterOperator Available=False reason. But that seems unlikely (it only has to be stable for ~10 minutes before ClusterOperatorDown fires), and we can revisit this approach if it crops up in practice.