Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37305

kube-apiserver operator revision rollout causing API disruption

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • 4.17.0
    • Etcd
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • Important
    • None
    • None
    • None
    • Rejected
    • ETCD Sprint 257, ETCD Sprint 258, ETCD Sprint 259
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      TRT tooling has detected changes in disruption for most kube/openshift/oauth API backends. Data actually looked quite good up until around the end of June, better than 4.16, but then something seems to have changed.

      There are a few patterns visible in the job runs, but for this bug we're focusing on one pattern that's fairly easy to find where around the time the kube-apiserver is rolling out a new revision, API endpoints take roughly 4-7s of disruption.

      Using this dashboard link, we had a rather consistent P95 of 0s until around June 22-23 and things begin getting erratic.

      This bug is the result of scanning the job list lower on the dashboard and identifying patterns in the disruption.

      For this bug, the specific pattern we're looking at is 5-7s of disruption when the kube-apiserver is rolling out a new revision.

      Examples:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1813938460914880512

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1813847818029240320

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1813213432246177792

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1811039682226556928

      Open Debug Tools > Intervals or expand Spyglass charts to see the disruption and it's correlation to operator updates.

      Error appears to always be net/http timeout awaiting response headers, and it affects both kube/openshift/oauth api, new and reused connections both.

      Fairly easy to find more examples but we do have to watch out for one of the other two patterns. (I've seen several examples of disruption during node updates, and another bar earlier in the job runs, cause unknown, possible future bugs)

        1. image-2024-08-02-14-19-03-571.png
          101 kB
          Thomas Jungblut
        2. image-2024-08-02-14-18-53-795.png
          81 kB
          Thomas Jungblut
        3. image-2024-08-02-14-01-33-811.png
          304 kB
          Thomas Jungblut
        4. image-2024-08-02-13-59-06-529.png
          132 kB
          Thomas Jungblut
        5. image-2024-08-02-13-35-19-481.png
          224 kB
          Thomas Jungblut
        6. image-2024-08-02-13-17-32-016.png
          126 kB
          Thomas Jungblut
        7. image-2024-08-02-13-11-09-710.png
          144 kB
          Thomas Jungblut
        8. image-2024-08-02-13-10-10-268.png
          140 kB
          Thomas Jungblut
        9. image-2024-08-02-13-07-36-806.png
          94 kB
          Thomas Jungblut

              dwest@redhat.com Dean West
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: