-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.15
-
None
-
Moderate
-
No
-
MON Sprint 245
-
1
-
False
-
-
NA
-
Release Note Not Required
-
In Progress
Description of problem:
Seen in 4.15 update CI:
: [bz-Monitoring] clusteroperator/monitoring should not change condition/Available expand_less Run #0: Failed expand_less 1h16m1s { 1 unexpected clusteroperator state transitions during e2e test run Nov 21 04:20:56.837 - 19s E clusteroperator/monitoring condition/Available reason/UpdatingPrometheusK8SFailed status/False reconciling Prometheus Federate Route failed: retrieving Route object failed: etcdserver: leader changed}
While the Kube API server is supposed to buffer clients from etcd leader transitions, an issue that only persists for 19s is not long enough to warrant immediate admin intervention. Teaching the monitoring operator to stay Available=True for this kind of brief hiccup, while still going Available=False for issues where least part of the component is non-functional, and that the condition requires immediate administrator intervention would make it easier for admins and SREs operating clusters to identify when intervention was required.
Version-Release number of selected component (if applicable):
A bunch of 4.15 jobs are impacted, almost all update jobs:
$ w3m -dump -cols 200 'https://search.ci.openshift.org/?maxAge=48h&type=junit&search=clusteroperator/monitoring+should+not+change+condition/Available' | grep '^periodic-.*4[.]15.*failures match' | sort periodic-ci-openshift-cluster-etcd-operator-release-4.15-periodics-e2e-aws-etcd-recovery (all) - 2 runs, 100% failed, 50% of failures match = 50% impact periodic-ci-openshift-hypershift-release-4.15-periodics-e2e-kubevirt-conformance (all) - 2 runs, 50% failed, 200% of failures match = 100% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-nightly-4.14-ocp-ovn-remote-libvirt-s390x (all) - 6 runs, 100% failed, 33% of failures match = 33% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade (all) - 5 runs, 40% failed, 50% of failures match = 20% impact periodic-ci-openshift-multiarch-master-nightly-4.15-upgrade-from-stable-4.14-ocp-e2e-upgrade-azure-ovn-heterogeneous (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-ovn-upgrade (all) - 50 runs, 56% failed, 4% of failures match = 2% impact periodic-ci-openshift-release-master-ci-4.15-e2e-azure-sdn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-e2e-gcp-ovn-upgrade (all) - 80 runs, 44% failed, 9% of failures match = 4% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-aws-ovn-upgrade (all) - 80 runs, 30% failed, 17% of failures match = 5% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-ovn-upgrade (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade (all) - 80 runs, 43% failed, 38% of failures match = 16% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-rt-upgrade (all) - 52 runs, 15% failed, 175% of failures match = 27% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-ovn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-gcp-sdn-upgrade (all) - 5 runs, 60% failed, 33% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-aws-ovn-single-node-serial (all) - 5 runs, 100% failed, 40% of failures match = 40% impact periodic-ci-openshift-release-master-nightly-4.15-e2e-ibmcloud-csi (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-sdn-upgrade (all) - 5 runs, 20% failed, 100% of failures match = 20% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-aws-upgrade-ovn-single-node (all) - 1 runs, 100% failed, 100% of failures match = 100% impact periodic-ci-openshift-release-master-nightly-4.15-upgrade-from-stable-4.14-e2e-metal-ipi-sdn-bm-upgrade (all) - 5 runs, 100% failed, 20% of failures match = 20% impact
Hit rates are low enough there that I haven't checked older 4.y. I'm not sure if all of those hits are UpdatingPrometheusK8SFailed or not, it seems likely that Kube API hiccups could impact a number of control loops. And there may be other triggers going on besides Kube API hiccups.
How reproducible:
16% impact in periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade looks like the current largest impact percentage among the jobs with double-digit run counts.
Steps to Reproduce:
Run periodic-ci-openshift-release-master-ci-4.15-upgrade-from-stable-4.14-e2e-azure-sdn-upgrade or another job with a combination of high-ish impact percentage and high run counts, watching the monitoring ClusterOperator's Available condition.
Actual results:
Blips of Available=False that resolve more quickly than a responding admin could be expected to show up.
Expected results:
Only going Available=False when it seems reasonable to summon an emergency admin response.
Additional info:
I have no problem if folks decide to push for Kube API server / etcd perfection, but that seems like a hard goal to reach reliably in the mess of the real world, so even if you do push those folks for improvements, I think it makes sense to relax your response to those kinds of issues to only complain when things like Route object retrieval failures go on for long enough for the operator to be seriously
- account is impacted by
-
OCPBUGS-39026 clusteroperator/monitoring blips Degraded=True during upgrade test
- ASSIGNED
- blocks
-
OCPBUGS-25800 monitoring ClusterOperator should not blip Available=False on quick etcd leader changes
- Closed
- is cloned by
-
OCPBUGS-35892 monitoring ClusterOperator should not blip Available=Unknown on client rate limiter
- Closed
- is triggering
-
OCPBUGS-25803 ConsolePluginComponents task is not resilient
- Verified
-
OCPBUGS-25849 PrometheusAdapter and MetricsServer tasks are conflict prone
- Verified
- relates to
-
OTA-362 CI: fail update suite if any ClusterOperator go Available=False
- Closed
- split to
-
MON-3587 make CMO ConsolePluginComponents and PrometheusAdapter tasks resilient to dependencies
- Closed
- links to
-
RHSA-2023:7198 OpenShift Container Platform 4.15 security update