Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-45955

Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, when `openshift-sdn` pods were deployed during the {product-title} upgrading process, the Open vSwitch (OVS) storage table was cleared. This issue occurred on {product-title} {product-version}.19 and later versions. Ports for existing pods had to be re-created and this disrupted numerous services. With this release, a fix ensures that the OVS tables do not get cleared and pods do not get disconnected during a cluster upgrade operation. (link:https://issues.redhat.com/browse/OCPBUGS-45955[*OCPBUGS-45955*])
      Show
      * Previously, when `openshift-sdn` pods were deployed during the {product-title} upgrading process, the Open vSwitch (OVS) storage table was cleared. This issue occurred on {product-title} {product-version}.19 and later versions. Ports for existing pods had to be re-created and this disrupted numerous services. With this release, a fix ensures that the OVS tables do not get cleared and pods do not get disconnected during a cluster upgrade operation. (link: https://issues.redhat.com/browse/OCPBUGS-45955 [* OCPBUGS-45955 *])
    • Bug Fix
    • Done

      This is a clone of issue OCPBUGS-45806. The following is the description of the original issue:

      Description of problem:
      During the OpenShift Container Platform 4.16 upgrade, when openshift-sdn pods are rolling out, the OVS table are being flushed, causing all ports for existing pods to be re-created.

      The OVS table flush is happening because of a flowVersion change that is required for some effort around the Limited Live Migration (similar changes were done in the past).

      Apparently for some customers, this flush is causing massive service disruption with, impacting production services for multiple minutes until they are recovering and are back into fully functional state.

      Such an impact in production is not acceptable and needs to be investigated to provide guidance, how disruption can be lowered/minimized.

      Version-Release number of selected component (if applicable):
      OpenShift Container Platform 4.16

      How reproducible:
      Random

      Steps to Reproduce:
      1. Upgrade OpenShift Container Platform 4 with OpenShiftSDN from OpenShift Container Platform 4.15 to 4.16
      2. Observe application for failing probes and how long it takes them to recover (probes should rely on other services)

      Actual results:
      Many services are reporting probe failures for multiple minutes until they recover or are being forefully re-created.

      Dec 04 13:26:22 worker-01 kubenswrapper[1980]: I1204 13:26:22.859373    1980 prober.go:107] "Probe failed" probeType="Liveness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
      Dec 04 13:26:22 worker-01 kubenswrapper[1980]: I1204 13:26:22.859378    1980 prober.go:107] "Probe failed" probeType="Readiness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
      Dec 04 13:26:25 worker-01 kubenswrapper[1980]: I1204 13:26:25.931213    1980 prober.go:107] "Probe failed" probeType="Liveness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
      Dec 04 13:26:25 worker-01 kubenswrapper[1980]: I1204 13:26:25.931265    1980 prober.go:107] "Probe failed" probeType="Readiness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
      

      Expected results:
      No disruption of service at all respectively it should go unnoticed and recover within a matter of seconds and not minutes.

      Additional info:

      Affected Platforms:
      The effect was seen on multiple large OpenShift Container Platform 4 - Clusters with +80 OpenShift Container Platform 4 - Node(s). The OpenShift Container Platform 4 - Clusters are running on Microsoft Azure and AWS and are showing the same effect.

              dwinship@redhat.com Dan Winship
              openshift-crt-jira-prow OpenShift Prow Bot
              Zhanqi Zhao Zhanqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: