Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: 4.13
Component/s: Machine Config Operator
Labels:
- trt

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
1
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:

4.13.0
Release Blocker:
Rejected
Sprint:
MCO Sprint 231
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Test Coverage:

+

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:
N/A

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

This bug is a result of analysis of jira ~~TRT-735~~. In all the cases analyzed, failures were transient. But MCDPivotError alert was latched for 15m and resulted in test failures.

This search will give you all the jobs that has this firing: https://search.ci.openshift.org/?search=MCDPivotError.*firing&maxAge=168h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Here is a link to slack discussion: https://redhat-internal.slack.com/archives/C02CZNQHGN8/p1672774860494109

Typically pivot error is caused by networking issues rpm-ostree encounters when performing a txn rebase. Connections to the registry could fail for different reasons. But the end result is that mcd_pivot_errors_total metrics is incremented whenever such an error occurs. Based on the definition for the alert here:https://github.com/openshift/machine-config-operator/blob/bab235d09cc3b9e6cf7a9b9149817fdb1c5e3649/install/0000_90_machine-config-operator_01_prometheus-rules.yaml#L76, we are firing the alert whenever such an error occurs and it will last 15m. Yet, in most of the cases we analyzed, this error were transient and a retry (within seconds) corrected the problem.

Here are a few questions:

If we expect transient errors like this and a follow-up retry will correct the issue within a minute, should we wait for some time (a minute?) to fire this alert?
Depending on the retry logic, we might need to revise the alert definition. For example, if we expect a constant retry interval (within a minute), we can still use the same definition, just to lower the latch from 15m to something much smaller. Since we are retrying at least one time within the last minute, it is guaranteed this value will keep incrementing in real errored condition.
Yet if we are using an exponentially increasing retry interval, we will need something else to really trigger the alert. @wking has something suggestions in the slack thread that might work in this case. But that means we will need to add more metrics to achieve the goal.

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

Additional info:

impacts account

TRT-735 Investigate MCDPivotError alerts firing

Closed

links to

openshift/machine-config-operator#3523: OCPBUGS-5520: MCDPivotError alert fires due temporary transient failures

Assignee:: David Joshy

Reporter:: Ken Zhang

Need Info From:: None

Contributors:: None

QA Contact:: Rio Liu

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/01/09 6:40 PM

Updated:: 2025/09/02 6:14 PM

Resolved:: 2023/05/17 10:32 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates