-
Bug
-
Resolution: Done
-
Critical
-
None
-
4.20.0
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
None
-
None
-
None
-
Proposed
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
For roughly three days, 4.20 nightlies are permafailing on this job:
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel]
[ { "metric": { "__name__": "ALERTS", "alertname": "KubeJobFailed", "alertstate": "firing", "condition": "true", "container": "kube-rbac-proxy-main", "endpoint": "https-main", "job": "kube-state-metrics", "job_name": "sre-replace-packageserver-csv", "namespace": "openshift-operator-lifecycle-manager", "prometheus": "openshift-monitoring/k8s", "service": "kube-state-metrics", "severity": "warning" }, "value": [ 1748242812.353, "1" ] } ]
Possible this could be related to a late change the OLM teams are trying to get into 4.19 to get their catalog versions updated?
Example job failure: https://prow.ci.openshift.org/view/gs/test-platform-results/logs/periodic-ci-openshift-release-master-nightly-4.20-e2e-rosa-sts-ovn/1926847779737440256
Once addressed, please check if there's a brittle SRE job here that could be improved to prevent in the future.