Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.12.z
Component/s: Networking / ovn-kubernetes
Labels:
- trt
- trt-regression

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:

4.12.z
Release Blocker:
Rejected
Sprint:
SDN Sprint 233
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Cloning bug for 4.12 in order to improve disruption times in upgrades from 4.12 to 4.13.

This is a long standing issue where gcp ovn for some reason sees dramatically more disruption to ingress during upgrades than other clouds. It can best be seen in the "ingress" graphs in charts such as: https://lookerstudio.google.com/s/v6xhLCTHHDY

Notice image-registry-new (which is ingress backed), ingress-to-console new, and ingress-to-oauth new, all of which take an average of 40s as of the time of this writing. For comparison, Azure is normally <10, and AWS <4.

You will also note the load-balancer new backend shows similar high disruption, but after conversations with network edge we now know the code paths for these two are very different, thus we're filing this as a separate bug. The SLB bug is https://issues.redhat.com/browse/OCPBUGS-6796. The two may prove to be same cause in future, as they do appear similar, but not identical even in terms of when the problems occur.

Some example prob jows are easy to find as the disruption is on average there. Note that we do not typically fail a test on these as the disruption monitoring stack is built to try to pin where we're at now, and this is a long standing issue.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620744632478470144

This job was near successful but got 45s of disruption to image-registry-new. The disruption observed can always be seen in artifacts such as: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620744632478470144/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/backend-disruption_20230201-120923.json

Expanding the first "Intervals - spyglass" chart on the main prowjob page, you can see when the disruption occurred and what else was going on in the cluster at that time.

This shows we're not getting a continuous 40+s of disruption, rather a few batches.

The ingress services all go down roughly together, the service load balancer pattern looks a little different, thus the different bug mentioned above.

For more examples just visit https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade%22%7D%5D%7D&sortField=timestamp&sort=desc, it will happen nearly every time.

When examining what else was going on when this happens, we see some clear patterns of nodes being updated.

clones

OCPBUGS-6947 Ingress Takes 40s on Average Downtime During GCP OVN Upgrades

Closed

links to

openshift/ovn-kubernetes#1569: [release-4.12] OCPBUGS-9912: Check the "Serving" field for endpoints

Assignee:: Riccardo Ravaioli

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/03/09 11:00 AM

Updated:: 2025/07/27 5:34 PM

Resolved:: 2023/03/21 4:29 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates