Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-5520

MCDPivotError alert fires due temporary transient failures

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • 4.13

      Description of problem:

      This bug is a result of analysis of jira TRT-735. In all the cases analyzed, failures were transient. But MCDPivotError alert was latched for 15m and resulted in test failures.

      This search will give you all the jobs that has this firing: https://search.ci.openshift.org/?search=MCDPivotError.*firing&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      Here is a link to slack discussion: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1672774860494109

      Typically pivot error is caused by networking issues rpm-ostree encounters when performing a txn rebase. Connections to the registry could fail for different reasons. But the end result is that mcd_pivot_errors_total metrics is incremented whenever such an error occurs. Based on the definition for the alert here:https://github.com/openshift/machine-config-operator/blob/bab235d09cc3b9e6cf7a9b9149817fdb1c5e3649/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L76, we are firing the alert whenever such an error occurs and it will last 15m. Yet, in most of the cases we analyzed, this error were transient and a retry (within seconds) corrected the problem.

       

      Here are a few questions:

      1. If we expect transient errors like this and a follow-up retry will correct the issue within a minute, should we wait for some time (a minute?) to fire this alert?
      2. Depending on the retry logic, we might need to revise the alert definition. For example, if we expect a constant retry interval (within a minute), we can still use the same definition, just to lower the latch from 15m to something much smaller. Since we are retrying at least one time within the last minute, it is guaranteed this value will keep incrementing in real errored condition.
      3. Yet if we are using an exponentially increasing retry interval, we will need something else to really trigger the alert. @wking has something suggestions in the slack thread that might work in this case. But that means we will need to add more metrics to achieve the goal.

      Version-Release number of selected component (if applicable):

       

      How reproducible:

       

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

       

      Additional info:

       

            djoshy David Joshy
            kenzhang@redhat.com Ken Zhang
            Rio Liu Rio Liu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: