Loading...

Type: Bug
Resolution: Done-Errata
Priority: Normal
Fix Version/s: 4.14.z
Affects Version/s: 4.15
Component/s: Monitoring
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:

4.14.z
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

This is a clone of issue ~~OCPBUGS-23745~~. The following is the description of the original issue:
—

Description of problem:

Seen in 4.15 update CI:

: [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less
Run #0: Failed expand_less 1h16m1s
{ 1 unexpected clusteroperator state transitions during e2e test run

Nov 21 04:20:56.837 - 19s E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusK8SFailed status/False reconciling Prometheus Federate Route failed: retrieving Route object failed: etcdserver: leader changed}

While the Kube API server is supposed to buffer clients from etcd leader transitions, an issue that only persists for 19s is not long enough to warrant immediate admin intervention. Teaching the monitoring operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.

Version-Release number of selected component (if applicable):

A bunch of 4.15 jobs are impacted, almost all update jobs:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available&#39; | grep '^periodic-.*4[.]15.*failures match' | sort
periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact
periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-kubevirt-conformance (all) - 2 runs, 50% failed, 200% of failures match = 100% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact
periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 4% of failures match = 2% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 9% of failures match = 4% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 17% of failures match = 5% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 38% of failures match = 16% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 52 runs, 15% failed, 175% of failures match = 27% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 5 runs, 60% failed, 33% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 40% of failures match = 40% impact
periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact
periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact

Hit rates are low enough there that I haven't checked older 4.y. I'm not sure if all of those hits are UpdatingPrometheusK8SFailed or not, it seems likely that Kube API hiccups could impact a number of control loops. And there may be other triggers going on besides Kube API hiccups.

How reproducible:

16% impact in periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade looks like the current largest impact percentage among the jobs with double-digit run counts.

Steps to Reproduce:

Run periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade or another job with a combination of high-ish impact percentage and high run counts, watching the monitoring ClusterOperator's Available condition.

Actual results:

Blips of Available=False that resolve more quickly than a responding admin could be expected to show up.

Expected results:

Only going Available=False when it seems reasonable to summon an emergency admin response.

Additional info:

I have no problem if folks decide to push for Kube API server / etcd perfection, but that seems like a hard goal to reach reliably in the mess of the real world, so even if you do push those folks for improvements, I think it makes sense to relax your response to those kinds of issues to only complain when things like Route object retrieval failures go on for long enough for the operator to be seriously

is blocked by

OCPBUGS-23745 monitoring ClusterOperator should not blip Available=False on quick etcd leader changes

Closed

links to

openshift/cluster-monitoring-operator#2216: [release-4.14] OCPBUGS-25800: Wait for 3 (instead of 2) consecutive failing reconcil…

RHBA-2024:0642 OpenShift Container Platform 4.14.z bug fix update

Details

Description

Description of problem:

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide