Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.14.0
Affects Version/s: 4.14
Component/s: Monitoring
Labels:
- trt-standup

Test Coverage:

-
Severity:
Important
Regression:
No
Sprint:
MON Sprint 237
sprint_count:
1
Release Blocker:
Approved
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:
N/A
Release Note Type:
Release Note Not Required
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

We have presubmit and periodic jobs failing on

: [sig-arch] events should not repeat pathologically for namespace openshift-monitoring
{  2 events happened too frequently

event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject 
event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject }

The failure occurs when the event happens over 20 times.

The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.

Version-Release number of selected component (if applicable):

4.14

How reproducible:

Run presubmits or periodics; here are latest examples:

 2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600                                                                                        | {aws,amd64,sdn,ha,serial}
 2023-05-24 10:20:54.91883+00  | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064                                                                                   | {gcp,amd64,sdn,ha,serial}
 2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560                                                                                      | {gcp,amd64,ovn,upgrade,upgrade-micro,ha}
 2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104                                                            | {azure,amd64,ovn,upgrade,upgrade-micro,ha}

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

That event/reason should not show up as a failure in the pathological test

Additional info:

This table shows what variants on 4.14 and Presubmits:

                     variants                     | test_count 
--------------------------------------------------+------------
 {aws,amd64,ovn,upgrade,upgrade-micro,ha}         |         63
 {gcp,amd64,ovn,upgrade,upgrade-micro,ha}         |         14
 {gcp,amd64,sdn,ha,serial,techpreview}            |         12
 {azure,amd64,sdn,ha,serial,techpreview}          |          7
 {aws,amd64,sdn,upgrade,upgrade-micro,ha}         |          6
 {aws,amd64,ovn,ha}                               |          6
 {vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} |          5
 {aws,amd64,sdn,ha,serial}                        |          5
 {azure,amd64,ovn,upgrade,upgrade-micro,ha}       |          5
 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha}   |          5
 {vsphere-ipi,amd64,ovn,ha,serial}                |          4
 {gcp,amd64,sdn,ha,serial}                        |          3
 {aws,amd64,ovn,single-node}                      |          3
 {metal-ipi,amd64,ovn,ha,serial}                  |          2
 {aws,amd64,ovn,ha,serial}                        |          2
 {aws,amd64,upgrade,upgrade-micro,ha}             |          1
 {aws,arm64,sdn,ha,serial}                        |          1
 {aws,arm64,ovn,ha,serial,techpreview}            |          1
 {vsphere-ipi,amd64,ovn,ha,serial,techpreview}    |          1
 {aws,amd64,sdn,ha,serial,techpreview}            |          1
 {libvirt,ppc64le,ovn,ha,serial}                  |          1
 {amd64,upgrade,upgrade-micro,ha}                 |          1

Just for my record, I'm using this query to check 4.14 and Presubmits:

SELECT
    rt.created_at, url, variants
FROM
    prow_jobs pj
    JOIN prow_job_runs r ON r.prow_job_id = pj.id
    JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
    JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
    JOIN tests ON rt.test_id = tests.id
WHERE
    pj.release IN ('4.14', 'Presubmits')
    AND rt.status = 12
    AND tests.id = 65991
    AND o.output LIKE '%RecreatingFailedPod%'
ORDER BY rt.created_at, variants DESC;

And this query for checking 4.13:

SELECT
    rt.created_at, url, variants
FROM
    prow_jobs pj
    JOIN prow_job_runs r ON r.prow_job_id = pj.id
    JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
    JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
    JOIN tests ON rt.test_id = tests.id
WHERE
    pj.release IN ('4.13')
    AND rt.status = 12
    AND tests.id IN (65991, 244,245)
    AND o.output LIKE '%RecreatingFailedPod%'
ORDER BY rt.created_at, variants DESC;

This shows jobs beginning on 4/13 to today.