-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.16.z
Description of problem:
During the OpenShift Container Platform 4.16 upgrade, when openshift-sdn pods are rolling out, the OVS table are being flushed, causing all ports for existing pods to be re-created.
The OVS table flush is happening because of a flowVersion change that is required for some effort around the Limited Live Migration (similar changes were done in the past).
Apparently for some customers, this flush is causing massive service disruption with, impacting production services for multiple minutes until they are recovering and are back into fully functional state.
Such an impact in production is not acceptable and needs to be investigated to provide guidance, how disruption can be lowered/minimized.
Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.16
How reproducible:
Random
Steps to Reproduce:
1. Upgrade OpenShift Container Platform 4 with OpenShiftSDN from OpenShift Container Platform 4.15 to 4.16
2. Observe application for failing probes and how long it takes them to recover (probes should rely on other services)
Actual results:
Many services are reporting probe failures for multiple minutes until they recover or are being forefully re-created.
Dec 04 13:26:22 worker-01 kubenswrapper[1980]: I1204 13:26:22.859373 1980 prober.go:107] "Probe failed" probeType="Liveness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host" Dec 04 13:26:22 worker-01 kubenswrapper[1980]: I1204 13:26:22.859378 1980 prober.go:107] "Probe failed" probeType="Readiness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host" Dec 04 13:26:25 worker-01 kubenswrapper[1980]: I1204 13:26:25.931213 1980 prober.go:107] "Probe failed" probeType="Liveness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host" Dec 04 13:26:25 worker-01 kubenswrapper[1980]: I1204 13:26:25.931265 1980 prober.go:107] "Probe failed" probeType="Readiness" pod="namespace/pod-abcde" podUID=4cbc08e5-16e3-491e-a1db-ce44ca0410c5 containerName="container" probeResult=failure output="Get \"http://10.1.1.25:5001/healthz\": dial tcp 10.1.1.25:5001: connect: no route to host"
Expected results:
No disruption of service at all respectively it should go unnoticed and recover within a matter of seconds and not minutes.
Additional info:
Affected Platforms:
The effect was seen on multiple large OpenShift Container Platform 4 - Clusters with +80 OpenShift Container Platform 4 - Node(s). The OpenShift Container Platform 4 - Clusters are running on Microsoft Azure and AWS and are showing the same effect.
- blocks
-
OCPBUGS-45955 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush
- Closed
- depends on
-
OCPBUGS-45920 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush
- Closed
- is caused by
-
OCPBUGS-25640 Clusters should not have ~1s HTTPS i/o timeout blips during updates
- Closed
- is cloned by
-
OCPBUGS-45955 Massive service disruption during OpenShift Container Platform 4.16 upgrade due to OVS table flush
- Closed
- links to
-
RHBA-2024:10973 OpenShift Container Platform 4.16.z bug fix update