-
Spike
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
3
-
False
-
False
-
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
-
MCO Sprint 240, MCO Sprint 243
-
0
-
0.000
Metrics are convoluted in the MCO. We use some of the controller's code to startup the other metrics listeners. Originally only the MCD had metrics support, and then for MCO-74 we added it to the MCC, but we more or less did it by duplicating the method used in the MCD rather than revisiting all of those decisions. There is no cohesion on how metrics should look or what they even mean in the MCO. We need to decide how we want to implement this properly. Making a pkg/metrics (or something like this) should enable users and future team members to understand our observability story. Creation and Teardown of metrics should all go through the same place.
- It came up in this PR: https://github.com/openshift/machine-config-operator/pull/2802 that we have duplication with regard to metrics handler startup
- Also, we continue starting the metrics listener even after metrics registrations fail
- We need to decide if metrics registration/startup failure should be fatal
- When thinking about centralizing, Simon pointed out that it would probably be clearer to have the callers of the start functions create a metrics registry and pass that around instead
- Furthermore when it comes to testing alerts in e2e, Simon mentioned that we could query the thanos api endpoint https://github.com/openshift/machine-config-operator/pull/2802/#discussion_r755919982
Acceptance Criteria
- All component of the MCO should call a general metric handler for metric registration and listener start.
- This handler should eventually be moved to the health controller for the purpose of unifying pool health monitoring and reporting tools.
- Bring a new health controller pod in so that it :
- will take care of all pool health reporting telemetries
- will register and listen to metrics from various part of the MCO, including the operator, the controller and the daemons, all in one place
- will report all updates and metric changes to Prometheus
- is related to
-
MCO-846 Customizable Observability in the MCO
- Closed
- split to
-
MCO-718 Centralize metric registering and listening for all MCO-subcomponent in a unified place
- Closed
-
MCO-719 Add the ability to turn off the metrics upon request
- Closed
-
MCO-768 Enable current metrics to be collected in the machinestatecontroller
- Closed
-
MCO-769 Metric syncing and reporting in the state controller
- Closed
- links to