Loading...

XML

Word

Printable

Type: Feature
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: OS
Labels:

Work Type:
BU Product Work
Blocked:
False
Ready:
False
Parent Link:
OCPPLAN-6665Observability Experience
Hierarchy Progress Bar:

25% To Do, 0% In Progress, 75% Done

Risk Score:
0

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

PX Priority Data:
PX Impact Score:
PX Review Complete:

To be broken into one feature epic and a spike:

feature: error type disambiguation and error propagation into operator status
*spike: general improvement on making errors more actionable for the end user*

The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:

The MCO operator status
The MCPool status
The MCController/Daemon pod logs
The journal logs on the node

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

The real error is hard to find
The error message is often generic and ambiguous
The solution/workaround is not clear at all

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

An incomplete update happened, and something rebooted the node
The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
The user modified something manually
Another operator modified something manually
Some other service/network manager overwrote something MCO writes

Etc. etc.

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

De-ambigufying different error cases with the same message
Adding more error catching, including journal logs and rpm-ostree errors
Propagating full error messages further up the stack, up to the operator status in a clear manner
Adding actionable fix/information messages alongside the error message

With a side objective of observability, including reporting all the way to the operator status items such as:

Reporting the status of all pools
Pointing out current status of update/upgrade per pool
What the update/upgrade is blocking on
How to unblock the upgrade

Approaches can include:

Better error messaging starting with common error cases
De-ambigufying config mismatch
Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
Capturing full daemon error message back to pool/operator status
Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
Adding better alerting messages for MCO errors

relates to

RFE-2647 Add/expose metrics of Machine-Config-Operator (MCO)

Accepted

Assignee:: Mark Russell

Reporter:: Yu Qi Zhang

QA Contact:: Cheng Zhang

Doc Contact:: Matthew Werner

Product Operations Engineering Contact:: Derrick Ornelas

Votes:: 1 Vote for this issue

Watchers:: 10 Start watching this issue

Created:: 2021/09/27 9:34 PM

Updated:: 2024/09/04 8:46 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates