Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-819

Supress KubePodNotReady Failures due to ErrImagePull

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

       

      We see regular failures recently due mostly to openshift-e2e-loki ErrImagePull issues caused by Akamai caching error pages.  If we know the cause isn't a product issue we don't want to fail the payload due to these issues.

       

      Kublet logs contain the failures we see recently due to an error page getting returned causing corrupt signatures error.  Log contains the locator including the namespace.  We can count these occurrences and when over a specific threshold filter alerting errors we see later on for those namespaces

      7758: Feb 01 05:37:45.731611 ci-op-vyccmv3h-4ef92-xs5k5-master-0 kubenswrapper[2213]: E0201 05:37:45.730879 2213 pod_workers.go:965] "Error syncing pod, skipping" err="failed to \"StartContainer\" for \"oauth-proxy\" with ErrImagePull: \"rpc error: code = Unknown desc = copying system image from manifest list: reading signatures: parsing signature https://registry.redhat.io/containers/sigstore/openshift4/ose-oauth-proxy@sha256=f968922564c3eea1c69d6bbe529d8970784d6cae8935afaf674d9fa7c0f72ea3/signature-9: unrecognized signature format, starting with binary 0x3c\"" pod="openshift-e2e-loki/loki-promtail-plm74" podUID=59b26cbf-3421-407c-98ee-986b5a091ef4

       

      We can extract the namespace from

      pod="openshift-e2e-loki/loki-promtail-plm74"

      Then when evalutating the alerts to check for failures, filter them out if we know that there have been X number of errors seen in the kublet logs.

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620652883970101248

      : [bz-Unknown][invariant] alert/KubePodNotReady should not be at or above info in all the other namespaces expand_less
                    0s 
                    
                      {  KubePodNotReady was at or above info for at least 2h47m30s on platformidentification.JobType{Release:"4.13", FromRelease:"4.12", Platform:"gcp", Architecture:"amd64", Network:"ovn", Topology:"ha"} (maxAllowed=0s): pending for 3m52s, firing for 2h47m30s:
      
      Feb 01 06:10:48.338 - 2398s W alert/KubePodNotReady ns/openshift-e2e-loki pod/loki-promtail-ld26r ALERTS{alertname="KubePodNotReady", alertstate="firing", namespace="openshift-e2e-loki", pod="loki-promtail-ld26r", prometheus="openshift-monitoring/k8s", severity="warning"}
      

       

       [sig-instrumentation] Prometheus [apigroup:image.openshift.io] when 
      installed on the cluster shouldn't report any alerts in firing state 
      apart from Watchdog and AlertmanagerReceiversNotConfigured 
      [Early][apigroup:config.openshift.io] [Skipped:Disconnected] 
      [Suite:openshift/conformance/parallel] expand_less
                                Run #0: Failed expand_less
                                1m2s
                                
                                  {  fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:522]: Unexpected error:
          <errors.aggregate | len:1, cap:1>: [
              <*errors.errorString | 0xc0014a0900>{
                  s: "promQL query returned unexpected results:\nALERTS{alertname!~\"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards|KubeJobFailed|Watchdog|KubePodNotReady|...\",alertstate=\"firing\",severity!=\"info\"} >= 1\n[\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"KubeContainerWaiting\",\n      \"alertstate\": \"firing\",\n      \"container\": \"oauth-proxy\",\n      \"namespace\": \"openshift-e2e-loki\",\n      \"pod\": \"loki-promtail-tfrnc\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"severity\": \"warning\"\n    },\n    \"value\": [\n      1675236853.465,\n      \"1\"\n    ]\n  },\n  {\n    \"metric\": {\n      \"__name__\": \"ALERTS\",\n      \"alertname\": \"KubeDaemonSetRolloutStuck\",\n      \"alertstate\": \"firing\",\n      \"container\": \"kube-rbac-proxy-main\",\n      \"daemonset\": \"loki-promtail\",\n      \"endpoint\": \"https-main\",\n      \"job\": \"kube-state-metrics\",\n      \"namespace\": \"openshift-e2e-loki\",\n      \"prometheus\": \"openshift-monitoring/k8s\",\n      \"service\": \"kube-state-metri...
       

       

       

       

              rh-ee-fbabcock Forrest Babcock
              rh-ee-fbabcock Forrest Babcock
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: