Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.18.z
Affects Version/s: 4.19.0
Component/s: Monitoring
Labels:
- component-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:

4.18.z
Release Blocker:
None
Sprint:
MON Sprint 274
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
Done
Release Note Type:
Bug Fix
Release Note Text:

Hide
Before this update, in multi-zone clusters with only a single worker per zone, if the Monitoring Operator's Prometheus pods were scheduled to nodes that reboot back-to-back and both reboots took longer than 15 minutes to return to service, the Monitoring Operator might have degraded. With this release, the time-out has been extended to 20 minutes to prevent the Monitoring Operator from entering a degraded state on common cluster topologies. Clusters where the two nodes with Prometheus pods reboot back-to-back and take more than 20 minutes might still report a degraded state until the second node and Prometheus pod return to a normal state. (link:https://issues.redhat.com/browse/OCPBUGS-59962[~~OCPBUGS-59962~~])

Show
Before this update, in multi-zone clusters with only a single worker per zone, if the Monitoring Operator's Prometheus pods were scheduled to nodes that reboot back-to-back and both reboots took longer than 15 minutes to return to service, the Monitoring Operator might have degraded. With this release, the time-out has been extended to 20 minutes to prevent the Monitoring Operator from entering a degraded state on common cluster topologies. Clusters where the two nodes with Prometheus pods reboot back-to-back and take more than 20 minutes might still report a degraded state until the second node and Prometheus pod return to a normal state. (link: https://issues.redhat.com/browse/OCPBUGS-59962 [ OCPBUGS-59962 ])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-59932~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-57215~~. The following is the description of the original issue:
—
(Feel free to update this bug's summary to be more specific.)
Component Readiness has found a potential regression in the following test:

[sig-arch][Feature:ClusterUpgrade] Cluster should be upgradeable after finishing upgrade [Late][Suite:upgrade]

Significant regression detected.
Fishers Exact probability of a regression: 100.00%.
Test pass rate dropped from 100.00% to 87.20%.
Regression is triaged and believed fixed as of 2025-06-06T16:00:00Z.

Sample (being evaluated) Release: 4.19
Start Time: 2025-06-02T00:00:00Z
End Time: 2025-06-09T12:00:00Z
Success Rate: 87.20%
Successes: 218
Failures: 32
Flakes: 0

Base (historical) Release: 4.15
Start Time: 2024-01-29T00:00:00Z
End Time: 2024-02-28T00:00:00Z
Success Rate: 100.00%
Successes: 625
Failures: 0
Flakes: 0

View the test details report for additional context.

This test unfortunately suffered a major outage late last week, but has failed an alarming number of times since with:

{  fail [github.com/openshift/origin/test/e2e/upgrade/upgrade.go:200]: cluster is reporting a failing condition: Cluster operator monitoring is degraded
Ginkgo exit error 1: exit with code 1}

Sample job runs are those in the report linked above since around Jun 7th. There appear to be about 6.

Example from yesterday:

https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-ci-4.19-upgrade-from-stable-4.18-e2e-gcp-ovn-rt-upgrade/1931532175954415616

Intervals show:

source/OperatorDegraded display/true condition/Degraded reason/UpdatingPrometheusFailed status/True UpdatingPrometheus: Prometheus "openshift-monitoring/k8s": SomePodsNotReady: shard 0: pod prometheus-k8s-1: 0/6 nodes are available: 1 node(s) were unschedulable, 2 node(s) had volume node affinity conflict, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/6 nodes are available: 6 Preemption is not helpful for scheduling. [2m49s]

blocks

OCPBUGS-60017 GCP Upgrades Failing Due to Monitoring Operator Degraded

Closed

clones

OCPBUGS-59932 GCP Upgrades Failing Due to Monitoring Operator Degraded

Closed

is blocked by

OCPBUGS-59932 GCP Upgrades Failing Due to Monitoring Operator Degraded

Closed

is cloned by

OCPBUGS-60017 GCP Upgrades Failing Due to Monitoring Operator Degraded

Closed

links to

openshift/cluster-monitoring-operator#2633: [release-4.18] OCPBUGS-59962: operator: increase wait time till degraded to max 4 times 5m

RHBA-2025:13325 OpenShift Container Platform 4.18.22 bug fix update

(1 links to)

Assignee:: Jan Fajerski

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Junqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/07/30 1:25 PM

Updated:: 2025/08/13 5:50 AM

Resolved:: 2025/08/13 5:50 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates