-
Epic
-
Resolution: Done
-
Normal
-
None
-
Observability in MCO
-
To Do
-
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
-
OCPSTRAT-554Improving error handling, propagation, collection, and disambiguation for users
-
5% To Do, 5% In Progress, 90% Done
-
0
-
0
It became clear overtime that we need to enhance most of the MCO metrics that we have as well as adding more related to the MCC. The MCC is tasked with watching what's going on with pools and it makes sense to add more metrics and alerting especially there. There are various hiccups with metrics that we've been and are going through. This epic aims at addressing those and start working on adding more useful metrics/alerting to the MCO. Another aim for this epic would be (but we can split it out) to provide more data to help us proactively debug clusters when things go wrong.
After spiking, the work for metric enhancement is split into the following way:
- Expose more pool health metrics, which includes (1) Expose metrics in MCD to enable node watcher (2) Expose metrics in MCO to enable MCP watcher (3) Expose metrics in MCC, especially for MCC sub-controllers, to enable a comprehensive watcher on both node, pool and configs
- oauth-proxy to kube-rbac-proxy migration for metric backend
- Metric infrastructure re-org to be ready for customization and CRD consumption
- This part of the work is originally prioritized and under construction with a design focusing on metric centralization: with the introduce of the state controller in
MCO-452, the MCO will use the state controller as a centralized metric registering, listening and reporting center. All the other sub-components of the MCO will report to the state controller when there is an update. By bringing in this unified infrastructure, the MCO provides the user with an entry point to touch metric configuration all at once. [USER CASE: the user can pass in a CRD with all the metrics they want to turn on, the state controller will then interpreting and syncing the customer-defined requirements passed in, and enable corresponding metrics accordingly] - However, the implementation of the work is severely delayed due to (1) the re-design of the state controller (See updates for
MCO-690) (2) the redesign for the message bus between the MCO sub-components and the state controller (See updates forMCO-751) - It is no longer within the scope for 4.15 and will be tracked in
MCO-846for 4.16
- This part of the work is originally prioritized and under construction with a design focusing on metric centralization: with the introduce of the state controller in
- is cloned by
-
MCO-846 Customizable Observability in the MCO
- Closed
- is related to
-
OCPBUGS-24003 mcd_config_drift not working properly
- ASSIGNED
-
OCPBUGS-904 Alerts from MCO are missing namespace
- Closed
-
OCPBUGS-1662 mcd_update_state metric should have a single time-series per node
- Closed
-
OCPBUGS-5497 MCDRebootError alarm disappears after 15 minutes
- Closed
- relates to
-
RFE-2647 Add/expose metrics of Machine-Config-Operator (MCO)
- Accepted
- links to