-
Epic
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
Actionable Error Messaging
-
False
-
False
-
To Do
-
OCPSTRAT-554 - Improving error handling, propagation, collection, and disambiguation for users
-
Impediment
-
OCPSTRAT-554Improving error handling, propagation, collection, and disambiguation for users
-
58% To Do, 5% In Progress, 37% Done
-
0
-
0
The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:
- The real error is hard to find
- The error message is often generic and ambiguous
- The solution/workaround is not clear at all
Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:
- An incomplete update happened, and something rebooted the node
- The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
- The user modified something manually
- Another operator modified something manually
- Some other service/network manager overwrote something MCO writes
Etc. etc.
Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:
- De-ambigufying different error cases with the same message
- Adding more error catching, including journal logs and rpm-ostree errors
- Propagating full error messages further up the stack, up to the operator status in a clear manner
- Adding actionable fix/information messages alongside the error message
With a side objective of observability, including reporting all the way to the operator status items such as:
- Reporting the status of all pools
- Pointing out current status of update/upgrade per pool
- What the update/upgrade is blocking on
- How to unblock the upgrade
Approaches can include:
- Better error messaging starting with common error cases
- De-ambigufying config mismatch
- Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
- Capturing full daemon error message back to pool/operator status
- Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
- Adding better alerting messages for MCO errors
Options
- is related to
-
RFE-4164 Ability to alert and intervene before a possible OCP downgrade
- Backlog
1.
|
Docs Tracker | Closed | Unassigned | ||
2.
|
PX Tracker | Closed | Unassigned | ||
3.
|
QE Tracker | Closed | Rio Liu | ||
4.
|
TE Tracker | Closed | Unassigned |