-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
None
-
None
job link
snippet from test output:
{ PrometheusOperatorWatchErrors was at or above info for at least 2m57s on platformidentification.JobType{Release:"4.11", FromRelease:"4.10", Platform:"aws", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 0s, firing for 2m57s: Jun 06 19:49:40.986 - 59s W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="alertmanager", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"} Jun 06 19:49:40.986 - 59s W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="prometheus", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"} Jun 06 19:49:40.986 - 59s W alert/PrometheusOperatorWatchErrors ns/openshift-monitoring ALERTS{alertname="PrometheusOperatorWatchErrors", alertstate="firing", controller="thanos", namespace="openshift-monitoring", prometheus="openshift-monitoring/k8s", severity="warning"}}
As of the creation of this bug, this is happening in the aws-ovn-upgrade job in 5% of all failed jobs. It's under 3% for
the sdn jobs. Seems worth looking in to. The few jobs I checked seemed to show this alert is pending and/or firing
at warning level at the very beginning of the e2e tests which is just after the cluster installation has completed. Maybe
initial cluster install churn is causing this and softening the alert expression is an option?
link to this job's testgrid for reference.