Uploaded image for project: 'OpenShift Container Platform (OCP) Strategy'
  1. OpenShift Container Platform (OCP) Strategy
  2. OCPSTRAT-554

Improving error handling, propagation, collection, and disambiguation for users

    XMLWordPrintable

Details

    • Feature
    • Resolution: Unresolved
    • Normal
    • None
    • None
    • OS
    • False
    • False
    • OCPPLAN-6665Observability Experience
    • 50
    • 50% 50%
    • 0
    • 0

    Description

      To be broken into one feature epic and a spike:

      • feature: error type disambiguation and error propagation into operator status
      • *spike: general improvement on making errors more actionable for the end user*

       

      The MCO today has multiple layers of errors. There are generally speaking 4 locations where an error message can appear, from highest to lowest:

      1. The MCO operator status
      2. The MCPool status
      3. The MCController/Daemon pod logs
      4. The journal logs on the node

       

      The error propagation is generally speaking not 1-to-1. The operator status will generally capture the pool status, but the full error from Controller/Daemon does not fully bubble up to pool/operator, and the journal logs with error generally don’t get bubbled up at all. This is very confusing for customers/admins working with the MCO without full understanding of the MCO’s internal mechanics:

      1. The real error is hard to find
      2. The error message is often generic and ambiguous
      3. The solution/workaround is not clear at all

       

      Using “unexpected on-disk state” as an example, this can be caused by any amount of the following:

      1. An incomplete update happened, and something rebooted the node
      2. The node upgrade was successful until rpm-ostree, which failed and atomically rolled back
      3. The user modified something manually
      4. Another operator modified something manually
      5. Some other service/network manager overwrote something MCO writes

      Etc. etc.

       

      Since error use cases are wide and varied, there are many improvements we can perform for each individual error state. This epic aims to propose targeted improvements to error messaging and propagation specifically. The goals being:

       

      1. De-ambigufying different error cases with the same message
      2. Adding more error catching, including journal logs and rpm-ostree errors
      3. Propagating full error messages further up the stack, up to the operator status in a clear manner
      4. Adding actionable fix/information messages alongside the error message

       

      With a side objective of observability, including reporting all the way to the operator status items such as:

      1. Reporting the status of all pools
      2. Pointing out current status of update/upgrade per pool
      3. What the update/upgrade is blocking on
      4. How to unblock the upgrade

      Approaches can include:

      1. Better error messaging starting with common error cases
      2. De-ambigufying config mismatch
      3. Capturing rpm-ostree logs from previous boot, in case of osimageurl mismatch errors
      4. Capturing full daemon error message back to pool/operator status
      5. Adding a new field to the MCO operator spec, that attempts to suggest fixes or where to look next, when an error occurs
      6. Adding better alerting messages for MCO errors

      Attachments

        Issue Links

          Activity

            People

              rhn-support-mrussell Mark Russell
              jerzhang@redhat.com Yu Qi Zhang
              Cheng Zhang Cheng Zhang
              Matthew Werner Matthew Werner
              Derrick Ornelas Derrick Ornelas
              Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated: