Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-10841

CI fails on "events should not repeat pathologically" because of missing node annotations

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done-Errata
    • Major
    • None
    • 4.13, 4.12, 4.14
    • None
    • Moderate
    • No
    • SDN Sprint 235, SDN Sprint 236, SDN Sprint 237, SDN Sprint 238, SDN Sprint 239
    • 5
    • Rejected
    • False
    • Hide

      None

      Show
      None

    Description

      Description of problem

      CI is flaky because of test failures such as the following:

      [sig-arch] events should not repeat pathologically
      {  2 events happened too frequently
      
      event happened 21 times, something is wrong: node/ip-10-0-162-91.us-west-2.compute.internal hmsg/e277cb97cf - pathological/true reason/ErrorReconcilingNode roles/worker [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-162-91.us-west-2.compute.internal, macAddress annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal"] From: 17:47:14Z To: 17:47:15Z result=reject 
      event happened 22 times, something is wrong: node/ip-10-0-162-91.us-west-2.compute.internal hmsg/e277cb97cf - pathological/true reason/ErrorReconcilingNode roles/worker [k8s.ovn.org/node-chassis-id annotation not found for node ip-10-0-162-91.us-west-2.compute.internal, macAddress annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal" , k8s.ovn.org/l3-gateway-config annotation not found for node "ip-10-0-162-91.us-west-2.compute.internal"] From: 17:47:15Z To: 17:47:16Z result=reject }
      

      This particular failure comes from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-ingress-operator/901/pull-ci-openshift-cluster-ingress-operator-master-e2e-aws-ovn-serial/1638557665338593280. Search.ci has many more similar failures.

      Version-Release number of selected component (if applicable):

      I have seen this in 4.12, 4.13, and 4.14 CI jobs.

      How reproducible:

      Presently, search.ci shows the following stats for the past two days:

      Found in 0.25% of runs (1.49% of failures) across 44431 total runs and 4957 jobs (16.76% failed) in 321ms
      

      Steps to Reproduce

      1. Post a PR and have bad luck.
      2. Check search.ci: https://search.ci.openshift.org/?search=event+happened+%5Cd%2B+times%2C+something+is+wrong%3A+.*macAddress+annotation+not+found+for+node&maxAge=48h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

      Actual results

      CI fails.

      Expected results

      CI passes, or fails on some other test failure.

      Additional info:

      In the search.ci results, the failures all appear to be in jobs with "serial" or "etcd-scaling" in the names. The failing jobs include AWS, Azure, and GCP, and no other platforms. I only checked the past 2 days because search.ci failed to load with a longer time horizon.

      Attachments

        Issue Links

          Activity

            People

              mkennell@redhat.com Martin Kennelly
              mmasters1@redhat.com Miciah Masters
              Huiran Wang Huiran Wang
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: