-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14
-
Quality / Stability / Reliability
-
False
-
-
None
-
Important
-
No
-
None
-
Approved
-
MON Sprint 237
-
1
-
-
-
None
-
Release Note Not Required
-
N/A
-
None
-
None
-
None
-
None
Description of problem:
We have presubmit and periodic jobs failing on
: [sig-arch] events should not repeat pathologically for namespace openshift-monitoring
{ 2 events happened too frequently
event happened 21 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/6f9bc9e1d7 - pathological/true reason/RecreatingFailedPod StatefulSet openshift-monitoring/prometheus-k8s is recreating failed Pod prometheus-k8s-1 From: 16:11:36Z To: 16:11:37Z result=reject
event happened 22 times, something is wrong: ns/openshift-monitoring statefulset/prometheus-k8s hmsg/ecfdd1d225 - pathological/true reason/SuccessfulDelete delete Pod prometheus-k8s-1 in StatefulSet prometheus-k8s successful From: 16:11:36Z To: 16:11:37Z result=reject }
The failure occurs when the event happens over 20 times.
The RecreatingFailedPod reason shows up in 4.14 and Presubmits and does not show up in 4.13.
Version-Release number of selected component (if applicable):
4.14
How reproducible:
Run presubmits or periodics; here are latest examples:
2023-05-24 06:25:52.551883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-aws-sdn-serial/1661210557367193600 | {aws,amd64,sdn,ha,serial}
2023-05-24 10:20:54.91883+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.14-e2e-gcp-sdn-serial/1661267817128792064 | {gcp,amd64,sdn,ha,serial}
2023-05-24 14:17:18.849402+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/27899/pull-ci-openshift-origin-master-e2e-gcp-ovn-upgrade/1661321663389634560 | {gcp,amd64,ovn,upgrade,upgrade-micro,ha}
2023-05-24 14:17:51.908405+00 | https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/1583/pull-ci-openshift-kubernetes-master-e2e-azure-ovn-upgrade/1661324100011823104 | {azure,amd64,ovn,upgrade,upgrade-micro,ha}
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
That event/reason should not show up as a failure in the pathological test
Additional info:
This table shows what variants on 4.14 and Presubmits:
variants | test_count
--------------------------------------------------+------------
{aws,amd64,ovn,upgrade,upgrade-micro,ha} | 63
{gcp,amd64,ovn,upgrade,upgrade-micro,ha} | 14
{gcp,amd64,sdn,ha,serial,techpreview} | 12
{azure,amd64,sdn,ha,serial,techpreview} | 7
{aws,amd64,sdn,upgrade,upgrade-micro,ha} | 6
{aws,amd64,ovn,ha} | 6
{vsphere-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5
{aws,amd64,sdn,ha,serial} | 5
{azure,amd64,ovn,upgrade,upgrade-micro,ha} | 5
{metal-ipi,amd64,ovn,upgrade,upgrade-micro,ha} | 5
{vsphere-ipi,amd64,ovn,ha,serial} | 4
{gcp,amd64,sdn,ha,serial} | 3
{aws,amd64,ovn,single-node} | 3
{metal-ipi,amd64,ovn,ha,serial} | 2
{aws,amd64,ovn,ha,serial} | 2
{aws,amd64,upgrade,upgrade-micro,ha} | 1
{aws,arm64,sdn,ha,serial} | 1
{aws,arm64,ovn,ha,serial,techpreview} | 1
{vsphere-ipi,amd64,ovn,ha,serial,techpreview} | 1
{aws,amd64,sdn,ha,serial,techpreview} | 1
{libvirt,ppc64le,ovn,ha,serial} | 1
{amd64,upgrade,upgrade-micro,ha} | 1
Just for my record, I'm using this query to check 4.14 and Presubmits:
SELECT
rt.created_at, url, variants
FROM
prow_jobs pj
JOIN prow_job_runs r ON r.prow_job_id = pj.id
JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
JOIN tests ON rt.test_id = tests.id
WHERE
pj.release IN ('4.14', 'Presubmits')
AND rt.status = 12
AND tests.id = 65991
AND o.output LIKE '%RecreatingFailedPod%'
ORDER BY rt.created_at, variants DESC;
And this query for checking 4.13:
SELECT
rt.created_at, url, variants
FROM
prow_jobs pj
JOIN prow_job_runs r ON r.prow_job_id = pj.id
JOIN prow_job_run_tests rt ON rt.prow_job_run_id = r.id
JOIN prow_job_run_test_outputs o ON o.prow_job_run_test_id = rt.id
JOIN tests ON rt.test_id = tests.id
WHERE
pj.release IN ('4.13')
AND rt.status = 12
AND tests.id IN (65991, 244,245)
AND o.output LIKE '%RecreatingFailedPod%'
ORDER BY rt.created_at, variants DESC;
This shows jobs beginning on 4/13 to today.
- relates to
-
OCPBUGS-1626 alertmanager pod restarted once to become ready
-
- Closed
-
- links to
-
RHEA-2023:5006
rpm