Details
-
Epic
-
Resolution: Unresolved
-
Normal
-
None
-
None
-
None
-
Enhance metrics and alerting
-
-
To Do
-
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
-
43
-
43%
-
0
-
0
Description
It became clear overtime that we need to enhance most of the MCO metrics that we have as well as adding more related to the MCC. The MCC is tasked with watching what's going on with pools and it makes sense to add more metrics and alerting especially there. There are various hiccups with metrics that we've been and are going through. This epic aims at addressing those and start working on adding more useful metrics/alerting to the MCO. Another aim for this epic would be (but we can split it out) to provide more data to help us proactively debug clusters when things go wrong.
There's a preliminary SPIKE attached to this epic (as well as more metrics related cards) that we'd need to hash out and refine first before moving on (the spike will help us to close/move/obsolete some of the attached cards perhaps)
Attachments
Issue Links
- is related to
-
OCPBUGS-904 Alerts from MCO are missing namespace
-
- Closed
-
-
OCPBUGS-1662 mcd_update_state metric should have a single time-series per node
-
- Closed
-
-
OCPBUGS-5497 MCDRebootError alarm disappears after 15 minutes
-
- Closed
-
- relates to
-
RFE-2647 Add/expose metrics of Machine-Config-Operator (MCO)
-
- Accepted
-
- links to