Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-13316

kube-api disruption on Azure after pods updated

    XMLWordPrintable

Details

    • Moderate
    • No
    • False
    • Hide

      None

      Show
      None

    Description

      There are two common patterns to Azure disruption for kube-apiserver, we need to be very careful to focus just on one in this bug. The scenario we are chasing is:

      1. a blip of disruption across kube-api (new and reused connections, same for openshift-api and oauth-api, but we focus on kube for simplicity)
      2. sometimes also affects re-used connections
      3. occurs long after the kube-api clusteroperator reports all nodes have reached new revision, about 20 minutes
      4. occurs before node OS updates, if there are any (micro upgrade)
      5. usually accompanied by kube-apiserver or guard pods emitting pod pod readiness problems

      Note that we focus on kube-api here, but the disruption usually hits openshift-api and oauth-api. Ingress is not affected.

      Intervals charts are key to initial investigation of these, I'm going to link directly to the new sippy intervals pages (see Intervals link under Debug Tools)  which allows me to prepopulate the filters I'm looking at to isolate the problem. Removing these filters may help you see more, these are just a starting point.

      Sample job runs:

      Sample job 1 (intervals) (loki logs): you can see the disruption at 3:53:34, accompanied by pod readiness problems below.

      Sample job 2 (intervals): Disruption at 12:54:45. No loki logs.

      Sample job 3 (intervals): Disruption at 1:02:41PM. No Loki logs.

       

      The loki logs were recently re-enabled for this job. (see Debug Tools on the prow job)

      This pattern is quite common, disruption prior to node reboots, well after clusteroperator reports no longer progressing, and pod readiness problems for kube-apiserver.

      Again removing my regex may help reveal additional details as to what's going on.

      Solving this issue will help alleviate a large chunk of the disruption we see on Azure which is an outlier in terms of the amount of disruption it experiences.

      More details coming.

       

      I think the problem appears on 4.14 Azure micro SDN upgrades as well:

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade/1655237865774256128

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade/1655579993335402496

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade/1655533239072198656

      So this is not likely an ovn problem.

      It also seems to affect minor upgrades from 4.13:

       

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1655904572033470464

      https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1655904571219775488

       

      For reasons I cannot explain, this issue almost disappeared for about 4 days between Apr 30 and May 3 inclusive. It is now back. The overall disruption graph is available to see the ebb and flow of this particular job, but it has been quite erratic. This page also shows the most recent 500 runs of the job if you ever need additional samples.

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhn-engineering-dgoodwin Devan Goodwin
            Ke Wang Ke Wang
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: