Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
None
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

We see regular failures recently due mostly to openshift-e2e-loki ErrImagePull issues caused by Akamai caching error pages. If we know the cause isn't a product issue we don't want to fail the payload due to these issues.

Kublet logs contain the failures we see recently due to an error page getting returned causing corrupt signatures error. Log contains the locator including the namespace. We can count these occurrences and when over a specific threshold filter alerting errors we see later on for those namespaces

7758: Feb 01 05:37:45.731611 ci-op-vyccmv3h-4ef92-xs5k5-master-0 kubenswrapper[2213]: E0201 05:37:45.730879 2213 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"oauth-proxy\" with ErrImagePull: \"rpc error: code = Unknown desc = copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/openshift4/ose-oauth-proxy@sha256=f968922564c3eea1c69d6bbe529d8970784d6cae8935afaf674d9fa7c0f72ea3/signature-9: unrecognized signature format, starting with binary 0x3c\"" pod="openshift-e2e-loki/loki-promtail-plm74" podUID=59b26cbf-3421-407c-98ee-986b5a091ef4

We can extract the namespace from

pod="openshift-e2e-loki/loki-promtail-plm74"

Then when evalutating the alerts to check for failures, filter them out if we know that there have been X number of errors seen in the kublet logs.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620652883970101248

: [bz-Unknown][invariant] alert/KubePodNotReady should not be at or above info in all the other namespaces expand_less
              0s 
              
                {  KubePodNotReady was at or above info for at least 2h47m30s on platformidentification.JobType{Release:"4.13", FromRelease:"4.12", Platform:"gcp", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 3m52s, firing for 2h47m30s:

Feb 01 06:10:48.338 - 2398s W alert/KubePodNotReady ns/openshift-e2e-loki pod/loki-promtail-ld26r ALERTS{alertname="KubePodNotReady", alertstate="firing", namespace="openshift-e2e-loki", pod="loki-promtail-ld26r", prometheus="openshift-monitoring/k8s", severity="warning"}

 [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when 
installed on the cluster shouldn't report any alerts in firing state 
apart from Watchdog and AlertmanagerReceiversNotConfigured 
[Early][apigroup:config.openshift.io] [Skipped:Disconnected] 
[Suite:openshift/conformance/parallel] expand_less
                          Run #0: Failed expand_less
                          1m2s
                          
                            {  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:522]: Unexpected error:
    <errors.aggregate | len:1, cap:1>: [
        <*errors.errorString | 0xc0014a0900>{
            s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|KubeJobFailed|Watchdog|KubePodNotReady|...\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"KubeContainerWaiting\",\n      \"alertstate\": \"firing\",\n      \"container\": \"oauth-proxy\",\n      \"namespace\": \"openshift-e2e-loki\",\n      \"pod\": \"loki-promtail-tfrnc\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1675236853.465,\n      \"1\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"KubeDaemonSetRolloutStuck\",\n      \"alertstate\": \"firing\",\n      \"container\": \"kube-rbac-proxy-main\",\n      \"daemonset\": \"loki-promtail\",\n      \"endpoint\": \"https-main\",\n      \"job\": \"kube-state-metrics\",\n      \"namespace\": \"openshift-e2e-loki\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"kube-state-metri...

links to

openshift/origin#27705: TRT-819: Add check for parse signature error

Assignee:: Forrest Babcock

Reporter:: Forrest Babcock

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/02/01 3:56 PM

Updated:: 2023/02/14 2:06 PM

Resolved:: 2023/02/14 2:06 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates