Loading...

XML

Word

Printable

Type: Epic
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Epic Name:
Actionable Error Messaging
Blocked:
False
Ready:
False
Epic Status:
In Progress
Flagged:

Impediment
Hierarchy Progress Bar:

38% To Do, 0% In Progress, 62% Done

WSJF:
0

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

The real error is hard to find
The error message is often generic and ambiguous
The solution/workaround is not clear at all

Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

An incomplete update happened, and something rebooted the node
The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
The user modified something manually
Another operator modified something manually
Some other service/network manager overwrote something MCO writes

Etc. etc.

Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

De-ambigufying different error cases with the same message
Adding more error catching, including journal logs and rpm-ostree errors
Propagating full error messages further up the stack, up to the operator status in a clear manner
Adding actionable fix/information messages alongside the error message

With a side objective of observability, including reporting all the way to the operator status items such as:

Reporting the status of all pools
Pointing out current status of update/upgrade per pool
What the update/upgrade is blocking on
How to unblock the upgrade

Approaches can include:

Better error messaging starting with common error cases
De-ambigufying config mismatch
Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
Capturing full daemon error message back to pool/operator status
Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
Adding better alerting messages for MCO errors

Options

is related to

RFE-4164 Ability to alert and intervene before a possible OCP downgrade

Backlog

links to

openshift/machine-config-operator#4771: MCO-1449: Add MCDPivotError runbook to prometheus rules

openshift/runbooks#226: MCO-1449: add runbook for MCDPivotError

openshift/runbooks#230: MCO-1492: Add runbook SystemMemoryExceedsReservation

There are no Sub-Tasks for this issue.

Assignee:: Team MCO

Reporter:: Michelle Krejci

QA Contact:: Rio Liu

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2021/11/04 9:24 PM

Updated:: 2025/08/07 7:20 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Sub-Tasks

Activity

People

Dates