Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-69706

Metal and Vsphere kube-api-new-connections Disruption Increase

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • None
    • In Progress
    • Bug Fix
    • Hide

      A problem with signal handling in the keepalived container caused failover to take longer than it should when nodes are restarted. This could disrupt access to API and ingress services. The signal handling was fixed so failover happens immediately and disruption is minimized.
      Show
      A problem with signal handling in the keepalived container caused failover to take longer than it should when nodes are restarted. This could disrupt access to API and ingress services. The signal handling was fixed so failover happens immediately and disruption is minimized.
    • None
    • None
    • None
    • None

      This is a clone of issue OCPBUGS-59925. The following is the description of the original issue:

      Description of problem

      Beginning around June 18th, we can see an increase in disruption metrics for 'kube-api-new-connections' in Metal and vSphere in '4.20' duing 'micro' upgrades. Changing the 'lookback days' to '1' allows us to see the pattern in the graph that leads to the 06/18/25 date. In '4.19' the pattern is less clear in Metal, but does appear present in VSphere. There are many occurrences of disruption within the job runs in the window of 4s and above. The percentage of non-zero disruption seems to be around 85-90% throughout the window.

      The increased disruption values are present in a various different jobs, including the following:

      1. periodic-ci-openshift-release-master-ci-4.20-e2e-vsphere-runc-upgrade (example)
      2.  periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ipi-ovn-upgrade-runc (example)
      3.  periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ipi-ovn-bm-upgrade (example)
      4. periodic-ci-openshift-release-master-nightly-4.20-e2e-metal-ipi-upgrade-ovn-ipv6 (example)
      Additional Information
      • It appears that the disruption occurs at the same time that there are alerts in the KubeletLog about failures to update lease.
      • These intervals show that a portion of the disruption correlates with the kube-apiserver shutdown. It starts prior to the shutdown, but there still may be some relevance.
      • Looking at PRs merged around 6/18, this one sticks out as potentially relevant (worth looking at, but I have no reason to believe it is the cause other than the timing).

              bnemec@redhat.com Benjamin Nemec
              sgoeddel@redhat.com Stephen Goeddel
              None
              None
              Ross Brattain Ross Brattain
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: