Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-49817

APIServer disruption regression during network operator upgrade

    • Important
    • Yes
    • CORENET Sprint 269
    • 1
    • Proposed
    • False
    • Hide

      None

      Show
      None

      TRT has detected an apparent disruption regression in 4.19 micro upgrades on aws. I'm not certain how widespread it is, there's a lot going on but it's quite visible here.

      The problem looks to have begun around Jan 17th, prior to this the aws micro P95 was consistently 0. Since then it's jumping as high as 6s depending on the day and lookback used. It seems to impact about 1/20 jobs, we see it around the 95th percentile and above. It's less clear below that.

      We do not see the same problem in 4.18 at this time.

      For this specific bug, the pattern I'm seeing is a band of disruption to new connections for all apiservers, during the network operator progressing phase of an upgrade, prior to rolling out node updates.

      Sample job runs:
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade/1886515783572393984
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade/1886126427712000000
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade/1884621809362407424
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade/1883866043437289472
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade/1882431829286326272
      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.19-e2e-aws-ovn-upgrade/1882431831811297280

      Clear bands of api disruption during network operator progressing. I don't think this was happening prior to the 17th of Jan.

      Examining job runs, there are a few patterns but the most

              jluhrsen Jamo Luhrsen
              rhn-engineering-dgoodwin Devan Goodwin
              Anurag Saxena Anurag Saxena
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: