Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: 4.21.0
Affects Version/s: 4.20.z, 4.21.0
Component/s: Monitoring
Labels:
- component-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:

4.20.z
Target Version:

4.21.0
Release Blocker:
Approved
Sprint:
MON Sprint 278, MON Sprint 279
sprint_count:
2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

[sig-arch] events should not repeat pathologically for ns/openshift-monitoring

Significant regression detected.
Fishers Exact probability of a regression: 99.99%.
Test pass rate dropped from 100.00% to 92.31%.

Sample (being evaluated) Release: 4.20
Start Time: 2025-09-26T00:00:00Z
End Time: 2025-10-03T08:00:00Z
Success Rate: 92.31%
Successes: 36
Failures: 3
Flakes: 0
Base (historical) Release: 4.18
Start Time: 2025-01-26T00:00:00Z
End Time: 2025-02-25T00:00:00Z
Success Rate: 100.00%
Successes: 145
Failures: 0
Flakes: 0

View the test details report for additional context.

The failure happens in other configurations but it's quite rare overall, so we haven't really seen this. Today it popped up in this specific metal report as it happened to hit the min 3 times.

Error message is:

[sig-arch] events should not repeat pathologically for ns/openshift-monitoring expand_less 	0s
{  1 events happened too frequently

event happened 25 times, something is wrong: namespace/openshift-monitoring node/worker-0 pod/prometheus-k8s-0 hmsg/357171899f - reason/Unhealthy Readiness probe errored: rpc error: code = Unknown desc = command error: cannot register an exec PID: container is stopping, stdout: , stderr: , exit code -1 (12:24:14Z) result=reject }

And it appears it happens just after the monitoring operator is upgrading, see this chart.

Note that this test is intended to protect the API server.

Global test analysis can be used to find failures in all jobs, and search ci can show these specific failures over the past two days. Quite common globally.

Filed by: dgoodwin@redhat.com

links to

openshift/origin#30372: OCPBUGS-62703: Relax duplicate events detection for Prometheus

Assignee:: Pranshu Srivastava

Reporter:: OpenShift Technical Release Team

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2025/10/03 11:37 AM

Updated:: 2025/11/13 2:54 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates