Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: 4.17.z
Affects Version/s: 4.19.0
Component/s: Monitoring
Labels:
- component-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.17.z
Release Blocker:
None
Sprint:
MON Sprint 274
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this release, in multi-zone clusters with a single worker per zone, the Monitoring Operator could degrade if the two nodes running its Prometheus pods rebooted sequentially and each took longer than 15 minutes to recover. With this release, the timeout is extended to 20 minutes reducing the likelihood of the Monitoring Operator entering a degraded state on common cluster topologies. (link:https://issues.redhat.com/browse/OCPBUGS-60017[~~OCPBUGS-60017~~])

Show
Before this release, in multi-zone clusters with a single worker per zone, the Monitoring Operator could degrade if the two nodes running its Prometheus pods rebooted sequentially and each took longer than 15 minutes to recover. With this release, the timeout is extended to 20 minutes reducing the likelihood of the Monitoring Operator entering a degraded state on common cluster topologies. (link: https://issues.redhat.com/browse/OCPBUGS-60017 [ OCPBUGS-60017 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-59962~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-59932~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-57215~~. The following is the description of the original issue:
—
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 87.20%.
Regression is triaged and believed fixed as of 2025-06-06T16:00:00Z.

Sample (being evaluated) Release: 4.19
Start Time: 2025-06-02T00:00:00Z
End Time: 2025-06-09T12:00:00Z
Success Rate: 87.20%
Successes: 218
Failures: 32
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-01-29T00:00:00Z
End Time: 2024-02-28T00:00:00Z
Success Rate: 100.00%
Successes: 625
Failures: 0
Flakes: 0

View the test details report for additional context.

This test unfortunately suffered a major outage late last week, but has failed an alarming number of times since with:

{  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:200]: cluster is reporting a failing condition: Cluster operator monitoring is degraded
Ginkgo exit error 1: exit with code 1}

Sample job runs are those in the report linked above since around Jun 7th. There appear to be about 6.

Example from yesterday:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade/1931532175954415616

Intervals show:

source/OperatorDegraded display/true condition/Degraded reason/UpdatingPrometheusFailed status/True UpdatingPrometheus: Prometheus "openshift-monitoring/k8s": SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling. [2m49s]

clones

OCPBUGS-59962 GCP Upgrades Failing Due to Monitoring Operator Degraded

Closed

is blocked by

OCPBUGS-59962 GCP Upgrades Failing Due to Monitoring Operator Degraded

Closed

links to

openshift/cluster-monitoring-operator#2634: [release-4.17] OCPBUGS-60017: operator: increase wait time till degraded to max 4 times 5m

Assignee:: Jan Fajerski

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/07/31 1:44 PM

Updated:: 2025/08/27 7:04 PM

Resolved:: 2025/08/27 5:59 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates