Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1258

Investigate a P95 Disruption Regression for GCP OVN Minor Upgrades

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Normal Normal
    • None
    • None
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None

      This regression appears to only affect these four backends:
      openshift-api-new-connections
      oauth-api-new-connections
      cache-openshift-api-new-connections
      cache-oauth-api-new-connections

      All four will experience significant disruption (over 5 minutes), and no others will show the same pattern.

      For whatever reason, kube-api does not seem affected. Nor does sdn, or micro upgrades.

      What do openshift-api and oauth-api have in common?

      The problem appears to have started on Sep 6. It is quite rare, only really visible at the daily 95th and 99th percentile. The next lowest we check if 75th and 50th and they do not show this regression. When it happens the pattern is fairly consistent.

      This graph shows the problem appearing on September 6th, viewing most recent runs you can see the small number of times this bug has appeared since.

      Some sample runs:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1703804765776908288

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1704228601563451392

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1704074538590932992

      Unfortunately, I think I found two hits back in 4.13, but only two.

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade/1704252507548553216

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-rt-upgrade/1702932973814288384

      Filtering down to just one backend to eliminate duplicates, we have 7 4.14 jobs that have experienced this since Sep 6.
      Sep 6 - 3 jobs
      Sep 17 - 1 job
      Sep 19 - 2 jobs
      Sep 20 - 1 job

      A failed payload from Sept 6 gives insight into the changes that were landing around that time: https://sippy.dptools.openshift.org/sippy-ng/release/4.14/tags/4.14.0-0.nightly-2023-09-06-235710/pull_requests

      There are a number of ovn related changes in the 33 PRs in that payload.

              rhn-engineering-dgoodwin Devan Goodwin
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: