-
Bug
-
Resolution: Done
-
Major
-
None
-
4.13
-
+
-
Important
-
None
-
MCO Sprint 231
-
1
-
Rejected
-
False
-
-
N/A
-
Bug Fix
-
Done
Description of problem:
This bug is a result of analysis of jira TRT-735. In all the cases analyzed, failures were transient. But MCDPivotError alert was latched for 15m and resulted in test failures.
This search will give you all the jobs that has this firing: https://search.ci.openshift.org/?search=MCDPivotError.*firing&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Here is a link to slack discussion: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1672774860494109
Typically pivot error is caused by networking issues rpm-ostree encounters when performing a txn rebase. Connections to the registry could fail for different reasons. But the end result is that mcd_pivot_errors_total metrics is incremented whenever such an error occurs. Based on the definition for the alert here:https://github.com/openshift/machine-config-operator/blob/bab235d09cc3b9e6cf7a9b9149817fdb1c5e3649/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L76, we are firing the alert whenever such an error occurs and it will last 15m. Yet, in most of the cases we analyzed, this error were transient and a retry (within seconds) corrected the problem.
Here are a few questions:
- If we expect transient errors like this and a follow-up retry will correct the issue within a minute, should we wait for some time (a minute?) to fire this alert?
- Depending on the retry logic, we might need to revise the alert definition. For example, if we expect a constant retry interval (within a minute), we can still use the same definition, just to lower the latch from 15m to something much smaller. Since we are retrying at least one time within the last minute, it is guaranteed this value will keep incrementing in real errored condition.
- Yet if we are using an exponentially increasing retry interval, we will need something else to really trigger the alert. @wking has something suggestions in the slack thread that might work in this case. But that means we will need to add more metrics to achieve the goal.
Version-Release number of selected component (if applicable):
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- impacts account
-
TRT-735 Investigate MCDPivotError alerts firing
- Closed
- links to