Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-14033

Pathological test failing on reason/RecreatingFailedPod in openshift-monitoring

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Critical Critical
    • 4.14.0
    • 4.14
    • Monitoring
    • -
    • Important
    • No
    • MON Sprint 237
    • 1
    • Approved
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required

      Description of problem:

      We have presubmit and periodic jobs failing on
      
      : [sig-arch] events should not repeat pathologically for namespace openshift-monitoring
      {  2 events happened too frequently
      
      event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject 
      event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject }
      
      The failure occurs when the event happens over 20 times.
      
      The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.
      
      

      Version-Release number of selected component (if applicable):

      4.14
      

      How reproducible:

      Run presubmits or periodics; here are latest examples:
      
       2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600                                                                                        | {aws,amd64,sdn,ha,serial}
       2023-05-24 10:20:54.91883+00  | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064                                                                                   | {gcp,amd64,sdn,ha,serial}
       2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560                                                                                      | {gcp,amd64,ovn,upgrade,upgrade-micro,ha}
       2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104                                                            | {azure,amd64,ovn,upgrade,upgrade-micro,ha}
      
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

       

      Expected results:

      That event/reason should not show up as a failure in the pathological test

      Additional info:

      This table shows what variants on 4.14 and Presubmits:

                           variants                     | test_count 
      --------------------------------------------------+------------
       {aws,amd64,ovn,upgrade,upgrade-micro,ha}         |         63
       {gcp,amd64,ovn,upgrade,upgrade-micro,ha}         |         14
       {gcp,amd64,sdn,ha,serial,techpreview}            |         12
       {azure,amd64,sdn,ha,serial,techpreview}          |          7
       {aws,amd64,sdn,upgrade,upgrade-micro,ha}         |          6
       {aws,amd64,ovn,ha}                               |          6
       {vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} |          5
       {aws,amd64,sdn,ha,serial}                        |          5
       {azure,amd64,ovn,upgrade,upgrade-micro,ha}       |          5
       {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha}   |          5
       {vsphere-ipi,amd64,ovn,ha,serial}                |          4
       {gcp,amd64,sdn,ha,serial}                        |          3
       {aws,amd64,ovn,single-node}                      |          3
       {metal-ipi,amd64,ovn,ha,serial}                  |          2
       {aws,amd64,ovn,ha,serial}                        |          2
       {aws,amd64,upgrade,upgrade-micro,ha}             |          1
       {aws,arm64,sdn,ha,serial}                        |          1
       {aws,arm64,ovn,ha,serial,techpreview}            |          1
       {vsphere-ipi,amd64,ovn,ha,serial,techpreview}    |          1
       {aws,amd64,sdn,ha,serial,techpreview}            |          1
       {libvirt,ppc64le,ovn,ha,serial}                  |          1
       {amd64,upgrade,upgrade-micro,ha}                 |          1
      

      Just for my record, I'm using this query to check 4.14 and Presubmits:

      SELECT
          rt.created_at, url, variants
      FROM
          prow_jobs pj
          JOIN prow_job_runs r ON r.prow_job_id = pj.id
          JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
          JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
          JOIN tests ON rt.test_id = tests.id
      WHERE
          pj.release IN ('4.14', 'Presubmits')
          AND rt.status = 12
          AND tests.id = 65991
          AND o.output LIKE '%RecreatingFailedPod%'
      ORDER BY rt.created_at, variants DESC;
      

      And this query for checking 4.13:

      SELECT
          rt.created_at, url, variants
      FROM
          prow_jobs pj
          JOIN prow_job_runs r ON r.prow_job_id = pj.id
          JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
          JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
          JOIN tests ON rt.test_id = tests.id
      WHERE
          pj.release IN ('4.13')
          AND rt.status = 12
          AND tests.id IN (65991, 244,245)
          AND o.output LIKE '%RecreatingFailedPod%'
      ORDER BY rt.created_at, variants DESC;
      

      This shows jobs beginning on 4/13 to today.

              spasquie@redhat.com Simon Pasquier
              dperique@redhat.com Dennis Periquet
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: