-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14
-
-
-
Important
-
No
-
MON Sprint 237
-
1
-
Approved
-
False
-
-
N/A
-
Release Note Not Required
Description of problem:
We have presubmit and periodic jobs failing on : [sig-arch] events should not repeat pathologically for namespace openshift-monitoring { 2 events happened too frequently event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject } The failure occurs when the event happens over 20 times. The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Run presubmits or periodics; here are latest examples: 2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600 | {aws,amd64,sdn,ha,serial} 2023-05-24 10:20:54.91883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064 | {gcp,amd64,sdn,ha,serial} 2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560 | {gcp,amd64,ovn,upgrade,upgrade-micro,ha} 2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104 | {azure,amd64,ovn,upgrade,upgrade-micro,ha}
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
That event/reason should not show up as a failure in the pathological test
Additional info:
This table shows what variants on 4.14 and Presubmits:
variants | test_count --------------------------------------------------+------------ {aws,amd64,ovn,upgrade,upgrade-micro,ha} | 63 {gcp,amd64,ovn,upgrade,upgrade-micro,ha} | 14 {gcp,amd64,sdn,ha,serial,techpreview} | 12 {azure,amd64,sdn,ha,serial,techpreview} | 7 {aws,amd64,sdn,upgrade,upgrade-micro,ha} | 6 {aws,amd64,ovn,ha} | 6 {vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {aws,amd64,sdn,ha,serial} | 5 {azure,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5 {vsphere-ipi,amd64,ovn,ha,serial} | 4 {gcp,amd64,sdn,ha,serial} | 3 {aws,amd64,ovn,single-node} | 3 {metal-ipi,amd64,ovn,ha,serial} | 2 {aws,amd64,ovn,ha,serial} | 2 {aws,amd64,upgrade,upgrade-micro,ha} | 1 {aws,arm64,sdn,ha,serial} | 1 {aws,arm64,ovn,ha,serial,techpreview} | 1 {vsphere-ipi,amd64,ovn,ha,serial,techpreview} | 1 {aws,amd64,sdn,ha,serial,techpreview} | 1 {libvirt,ppc64le,ovn,ha,serial} | 1 {amd64,upgrade,upgrade-micro,ha} | 1
Just for my record, I'm using this query to check 4.14 and Presubmits:
SELECT rt.created_at, url, variants FROM prow_jobs pj JOIN prow_job_runs r ON r.prow_job_id = pj.id JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id JOIN tests ON rt.test_id = tests.id WHERE pj.release IN ('4.14', 'Presubmits') AND rt.status = 12 AND tests.id = 65991 AND o.output LIKE '%RecreatingFailedPod%' ORDER BY rt.created_at, variants DESC;
And this query for checking 4.13:
SELECT rt.created_at, url, variants FROM prow_jobs pj JOIN prow_job_runs r ON r.prow_job_id = pj.id JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id JOIN tests ON rt.test_id = tests.id WHERE pj.release IN ('4.13') AND rt.status = 12 AND tests.id IN (65991, 244,245) AND o.output LIKE '%RecreatingFailedPod%' ORDER BY rt.created_at, variants DESC;
This shows jobs beginning on 4/13 to today.
- relates to
-
OCPBUGS-1626 alertmanager pod restarted once to become ready
- Closed
- links to
-
RHEA-2023:5006 rpm