Loading...

XML

Word

Printable

Type: Spike
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Story Points:
3
Blocked:
False
Ready:
False
Epic Link:
Observability in MCO
Feature Link:
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users

Sprint:
MCO Sprint 240, MCO Sprint 243
Cost of Delay:
0
WSJF:
0.000

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

Metrics are convoluted in the MCO. We use some of the controller's code to startup the other metrics listeners. Originally only the MCD had metrics support, and then for ~~MCO-74~~ we added it to the MCC, but we more or less did it by duplicating the method used in the MCD rather than revisiting all of those decisions. There is no cohesion on how metrics should look or what they even mean in the MCO. We need to decide how we want to implement this properly. Making a pkg/metrics (or something like this) should enable users and future team members to understand our observability story. Creation and Teardown of metrics should all go through the same place.

It came up in this PR: https://github.com/openshift/machine-config-operator/pull/2802 that we have duplication with regard to metrics handler startup
Also, we continue starting the metrics listener even after metrics registrations fail
- We need to decide if metrics registration/startup failure should be fatal
When thinking about centralizing, Simon pointed out that it would probably be clearer to have the callers of the start functions create a metrics registry and pass that around instead
Furthermore when it comes to testing alerts in e2e, Simon mentioned that we could query the thanos api endpoint https://github.com/openshift/machine-config-operator/pull/2802/#discussion_r755919982

Acceptance Criteria

All component of the MCO should call a general metric handler for metric registration and listener start.
This handler should eventually be moved to the health controller for the purpose of unifying pool health monitoring and reporting tools.
Bring a new health controller pod in so that it :
- will take care of all pool health reporting telemetries
- will register and listen to metrics from various part of the MCO, including the operator, the controller and the daemons, all in one place
- will report all updates and metric changes to Prometheus

is related to

MCO-846 Customizable Observability in the MCO

Closed

split to

MCO-718 Centralize metric registering and listening for all MCO-subcomponent in a unified place

Closed

MCO-719 Add the ability to turn off the metrics upon request

Closed

MCO-768 Enable current metrics to be collected in the machinestatecontroller

Closed

MCO-769 Metric syncing and reporting in the state controller

Closed

links to

MCO-134 Centralize / standardize metrics registration / handler startup and teardown Design

openshift/machine-config-operator#2802: Send alert when MCO can't safely apply updated Kubelet CA on nodes in paused pool

(2 links to)

Assignee:: Ines Qian (Inactive)

Reporter:: John Kyros

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2021/12/16 2:00 AM

Updated:: 2024/08/30 3:39 PM

Resolved:: 2023/10/04 8:51 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates