-
Story
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
-
We see regular failures recently due mostly to openshift-e2e-loki ErrImagePull issues caused by Akamai caching error pages. If we know the cause isn't a product issue we don't want to fail the payload due to these issues.
Kublet logs contain the failures we see recently due to an error page getting returned causing corrupt signatures error. Log contains the locator including the namespace. We can count these occurrences and when over a specific threshold filter alerting errors we see later on for those namespaces
7758: Feb 01 05:37:45.731611 ci-op-vyccmv3h-4ef92-xs5k5-master-0 kubenswrapper[2213]: E0201 05:37:45.730879 2213 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"oauth-proxy\" with ErrImagePull: \"rpc error: code = Unknown desc = copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/openshift4/ose-oauth-proxy@sha256=f968922564c3eea1c69d6bbe529d8970784d6cae8935afaf674d9fa7c0f72ea3/signature-9: unrecognized signature format, starting with binary 0x3c\"" pod="openshift-e2e-loki/loki-promtail-plm74" podUID=59b26cbf-3421-407c-98ee-986b5a091ef4
We can extract the namespace from
pod="openshift-e2e-loki/loki-promtail-plm74"
Then when evalutating the alerts to check for failures, filter them out if we know that there have been X number of errors seen in the kublet logs.
: [bz-Unknown][invariant] alert/KubePodNotReady should not be at or above info in all the other namespaces expand_less 0s { KubePodNotReady was at or above info for at least 2h47m30s on platformidentification.JobType{Release:"4.13", FromRelease:"4.12", Platform:"gcp", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 3m52s, firing for 2h47m30s: Feb 01 06:10:48.338 - 2398s W alert/KubePodNotReady ns/openshift-e2e-loki pod/loki-promtail-ld26r ALERTS{alertname="KubePodNotReady", alertstate="firing", namespace="openshift-e2e-loki", pod="loki-promtail-ld26r", prometheus="openshift-monitoring/k8s", severity="warning"}
[sig-instrumentation] Prometheus [apigroup:image.openshift.io] when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early][apigroup:config.openshift.io] [Skipped:Disconnected] [Suite:openshift/conformance/parallel] expand_less Run #0: Failed expand_less 1m2s { fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:522]: Unexpected error: <errors.aggregate | len:1, cap:1>: [ <*errors.errorString | 0xc0014a0900>{ s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|KubeJobFailed|Watchdog|KubePodNotReady|...\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n {\n \"metric\": {\n \"__name__\": \"ALERTS\",\n \"alertname\": \"KubeContainerWaiting\",\n \"alertstate\": \"firing\",\n \"container\": \"oauth-proxy\",\n \"namespace\": \"openshift-e2e-loki\",\n \"pod\": \"loki-promtail-tfrnc\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"severity\": \"warning\"\n },\n \"value\": [\n 1675236853.465,\n \"1\"\n ]\n },\n {\n \"metric\": {\n \"__name__\": \"ALERTS\",\n \"alertname\": \"KubeDaemonSetRolloutStuck\",\n \"alertstate\": \"firing\",\n \"container\": \"kube-rbac-proxy-main\",\n \"daemonset\": \"loki-promtail\",\n \"endpoint\": \"https-main\",\n \"job\": \"kube-state-metrics\",\n \"namespace\": \"openshift-e2e-loki\",\n \"prometheus\": \"openshift-monitoring/k8s\",\n \"service\": \"kube-state-metri...