Loading...

XML

Word

Printable

Type: Epic
Resolution: Done
Priority: Normal
Fix Version/s: 4.15.0
Affects Version/s: None
Labels:

Epic Name:
Observability in MCO
Epic Status:
To Do
Feature Link:
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
Parent Link:
OCPSTRAT-554Improving error handling, propagation, collection, and disambiguation for users
Hierarchy Progress Bar:

5% To Do, 0% In Progress, 95% Done
Target Version:

openshift-4.15

Cost of Delay:
0
WSJF:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

It became clear overtime that we need to enhance most of the MCO metrics that we have as well as adding more related to the MCC. The MCC is tasked with watching what's going on with pools and it makes sense to add more metrics and alerting especially there. There are various hiccups with metrics that we've been and are going through. This epic aims at addressing those and start working on adding more useful metrics/alerting to the MCO. Another aim for this epic would be (but we can split it out) to provide more data to help us proactively debug clusters when things go wrong.

After spiking, the work for metric enhancement is split into the following way:

Expose more pool health metrics, which includes (1) Expose metrics in MCD to enable node watcher (2) Expose metrics in MCO to enable MCP watcher (3) Expose metrics in MCC, especially for MCC sub-controllers, to enable a comprehensive watcher on both node, pool and configs
oauth-proxy to kube-rbac-proxy migration for metric backend
Metric infrastructure re-org to be ready for customization and CRD consumption
- This part of the work is originally prioritized and under construction with a design focusing on metric centralization: with the introduce of the state controller in ~~MCO-452~~, the MCO will use the state controller as a centralized metric registering, listening and reporting center. All the other sub-components of the MCO will report to the state controller when there is an update. By bringing in this unified infrastructure, the MCO provides the user with an entry point to touch metric configuration all at once. [USER CASE: the user can pass in a CRD with all the metrics they want to turn on, the state controller will then interpreting and syncing the customer-defined requirements passed in, and enable corresponding metrics accordingly]
- However, the implementation of the work is severely delayed due to (1) the re-design of the state controller (See updates for ~~MCO-690~~) (2) the redesign for the message bus between the MCO sub-components and the state controller (See updates for ~~MCO-751~~)
- It is no longer within the scope for 4.15 and will be tracked in ~~MCO-846~~ for 4.16

is cloned by

MCO-846 Customizable Observability in the MCO

Closed

is related to

OCPBUGS-24003 mcd_config_drift not working properly

ASSIGNED

OCPBUGS-904 Alerts from MCO are missing namespace

Closed

OCPBUGS-1662 mcd_update_state metric should have a single time-series per node

Closed

OCPBUGS-5497 MCDRebootError alarm disappears after 15 minutes

Closed

relates to

RFE-2647 Add/expose metrics of Machine-Config-Operator (MCO)

Closed

links to

all the alert rules' annotations "summary" and "description" should comply with the OpenShift alerting guidelines

https://bugzilla.redhat.com/show_bug.cgi?id=2010371

openshift/machine-config-operator#3406: Bug 1853264: Fix unbound cardinality for MCDRebootErr and MCDPivotErr

(1 relates to, 3 links to)

Assignee:: Ines Qian (Inactive)

Reporter:: Antonio Murdaca

QA Contact:: Sergio Regidor de la Rosa

Votes:: 1 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2020/08/26 10:37 AM

Updated:: 2024/03/22 5:57 PM

Resolved:: 2024/03/22 5:57 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates