Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-18546

Multi-arch AWS OVN Micro Upgrade Jobs Experiencing Severe Ingress Disruption

XMLWordPrintable

    • Important
    • No
    • Proposed
    • False
    • Hide

      None

      Show
      None

      https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&from=now-90d&to=now&var-platform=aws&var-percentile=P50&var-backend=ingress-to-console-new-connections&var-backend=service-load-balancer-with-pdb-new-connections&var-releases=4.14&var-upgrade_type=micro&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=1&var-lookback=1&var-master_node_updated=Y

      In the above graph we can see that sometime shortly after Aug 11, disruption spiked severely for new and reused connections, to all ingress related backends.

      Expanding the Most Recent Job Runs panel on the above link shows that all the bad results are coming from periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade. This job is seeing 100-600s of disruption, whereas the normal non-multi-arch job is typically 0-1s.

      Two sample jobs:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade/1698740856976052224

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade/1698753834618195968

      Expanding the spyglass chart, we see the disruption happens at roughly the same time for all backends and lasts minutes.

      Using a spyglass search string of "disrupt|OVN|ovn", it's possible this is OVN struggling at this time? There are alerts around this time. It's curious that only ingress related backends are showing disruption however, the apiservers all seem ok.

              jeffdyoung Jeff Young
              rhn-engineering-dgoodwin Devan Goodwin
              Doug Slavens Doug Slavens
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: