[OCPBUGS-45806] Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush - Red Hat Issue Tracker

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.16.z
Affects Version/s: 4.16.z
Component/s: Networking / openshift-sdn
Labels:

Regression:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Release Note Text:

Hide
* Previously, when `openshift-sdn` pods were deployed during the {product-title} upgrading process, the Open vSwitch (OVS) storage table was cleared. This issue occurred on {product-title} {product-version}.19 and later versions. Ports for existing pods had to be re-created and this caused disruption to numerous services. With this release, a fix ensures that the OVS tables do not get cleared and pods do not get disconnected during a cluster upgrade operation. (link:https://issues.redhat.com/browse/OCPBUGS-45806[*~~OCPBUGS-45806~~*])

Show
* Previously, when `openshift-sdn` pods were deployed during the {product-title} upgrading process, the Open vSwitch (OVS) storage table was cleared. This issue occurred on {product-title} {product-version}.19 and later versions. Ports for existing pods had to be re-created and this caused disruption to numerous services. With this release, a fix ensures that the OVS tables do not get cleared and pods do not get disconnected during a cluster upgrade operation. (link: https://issues.redhat.com/browse/OCPBUGS-45806 [* OCPBUGS-45806 *])
Release Note Type:
Bug Fix
Release Note Status:
Done
Target Version:

4.16.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Description of problem:
During the OpenShift Container Platform 4.16 upgrade, when openshift-sdn pods are rolling out, the OVS table are being flushed, causing all ports for existing pods to be re-created.

The OVS table flush is happening because of a flowVersion change that is required for some effort around the Limited Live Migration (similar changes were done in the past).

Apparently for some customers, this flush is causing massive service disruption with, impacting production services for multiple minutes until they are recovering and are back into fully functional state.

Such an impact in production is not acceptable and needs to be investigated to provide guidance, how disruption can be lowered/minimized.

Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16

How reproducible:
Random

Steps to Reproduce:
1. Upgrade OpenShift Container Platform 4 with OpenShiftSDN from OpenShift Container Platform 4.15 to 4.16
2. Observe application for failing probes and how long it takes them to recover (probes should rely on other services)

Actual results:
Many services are reporting probe failures for multiple minutes until they recover or are being forefully re-created.

Dec 04 13:26:22 worker-01 kubenswrapper[1980]: I1204 13:26:22.859373    1980 prober.go:107] "Probe failed" probeType="Liveness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
Dec 04 13:26:22 worker-01 kubenswrapper[1980]: I1204 13:26:22.859378    1980 prober.go:107] "Probe failed" probeType="Readiness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
Dec 04 13:26:25 worker-01 kubenswrapper[1980]: I1204 13:26:25.931213    1980 prober.go:107] "Probe failed" probeType="Liveness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
Dec 04 13:26:25 worker-01 kubenswrapper[1980]: I1204 13:26:25.931265    1980 prober.go:107] "Probe failed" probeType="Readiness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"

Expected results:
No disruption of service at all respectively it should go unnoticed and recover within a matter of seconds and not minutes.

Additional info:

Affected Platforms:
The effect was seen on multiple large OpenShift Container Platform 4 - Clusters with +80 OpenShift Container Platform 4 - Node(s). The OpenShift Container Platform 4 - Clusters are running on Microsoft Azure and AWS and are showing the same effect.

blocks

OCPBUGS-45955 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush

Closed

depends on

OCPBUGS-45920 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush

Closed

is caused by

OCPBUGS-25640 Clusters should not have ~1s HTTPS i/o timeout blips during updates

Closed

is cloned by

OCPBUGS-45955 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush

Closed

links to

openshift/sdn#643: OCPBUGS-45806: Stop checking ruleVersion [4.16]

RHBA-2024:10973 OpenShift Container Platform 4.16.z bug fix update

Slack conversation about the problem

(2 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates