-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.20.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
Proposed
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
For roughly three days, 4.20 nightlies are permafailing on this job:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[
{
"metric": {
"__name__": "ALERTS",
"alertname": "KubeJobFailed",
"alertstate": "firing",
"condition": "true",
"container": "kube-rbac-proxy-main",
"endpoint": "https-main",
"job": "kube-state-metrics",
"job_name": "sre-replace-packageserver-csv",
"namespace": "openshift-operator-lifecycle-manager",
"prometheus": "openshift-monitoring/k8s",
"service": "kube-state-metrics",
"severity": "warning"
},
"value": [
1748242812.353,
"1"
]
}
]
Possible this could be related to a late change the OLM teams are trying to get into 4.19 to get their catalog versions updated?
Example job failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-rosa-sts-ovn/1926847779737440256
Once addressed, please check if there's a brittle SRE job here that could be improved to prevent in the future.