-
Bug
-
Resolution: Done-Errata
-
Normal
-
4.12.z
-
None
-
Moderate
-
No
-
Rejected
-
False
-
Description of problem
CI is flaky because of test failures such as the following:
[sig-arch] events should not repeat pathologically { 2 events happened too frequently event happened 21 times, something is wrong: node/ip-10-0-162-91.us-west-2.compute.internal hmsg/e277cb97cf - pathological/true reason/ErrorReconcilingNode roles/worker [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-162-91.us-west-2.compute.internal, macAddress annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal"] From: 17:47:14Z To: 17:47:15Z result=reject event happened 22 times, something is wrong: node/ip-10-0-162-91.us-west-2.compute.internal hmsg/e277cb97cf - pathological/true reason/ErrorReconcilingNode roles/worker [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-162-91.us-west-2.compute.internal, macAddress annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal"] From: 17:47:15Z To: 17:47:16Z result=reject }
This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/901/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial/1638557665338593280. Search.ci has many more similar failures.
Version-Release number of selected component (if applicable):
I have seen this in 4.12, 4.13, and 4.14 CI jobs.
How reproducible:
Presently, search.ci shows the following stats for the past two days:
Found in 0.25% of runs (1.49% of failures) across 44431 total runs and 4957 jobs (16.76% failed) in 321ms
Steps to Reproduce
1. Post a PR and have bad luck.
2. Check search.ci: https://search.ci.openshift.org/?search=event+happened+%5Cd%2B+times%2C+something+is+wrong%3A+.*macAddress+annotation+not+found+for+node&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Actual results
CI fails.
Expected results
CI passes, or fails on some other test failure.
Additional info:
In the search.ci results, the failures all appear to be in jobs with "serial" or "etcd-scaling" in the names. The failing jobs include AWS, Azure, and GCP, and no other platforms. I only checked the past 2 days because search.ci failed to load with a longer time horizon.
- clones
-
OCPBUGS-17910 [4.13] CI fails on "events should not repeat pathologically" because of missing node annotations
- Closed
- depends on
-
OCPBUGS-17910 [4.13] CI fails on "events should not repeat pathologically" because of missing node annotations
- Closed
- links to
-
RHBA-2023:5450 OpenShift Container Platform 4.12.z bug fix update