Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 4.12.z
Affects Version/s: 4.13.0
Component/s: Networking / ovn-kubernetes
Labels:
- trt
- trt-regression

Severity:
Moderate
Regression:
No
Sprint:
SDN Sprint 234, SDN Sprint 235
sprint_count:
2
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.12.z

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a clone of issue ~~OCPBUGS-11458~~. The following is the description of the original issue:
—
This is a clone of issue ~~OCPBUGS-6947~~. The following is the description of the original issue:
—
This is a long standing issue where gcp ovn for some reason sees dramatically more disruption to ingress during upgrades than other clouds. It can best be seen in the "ingress" graphs in charts such as: https://lookerstudio.google.com/s/v6xhLCTHHDY

Notice image-registry-new (which is ingress backed), ingress-to-console new, and ingress-to-oauth new, all of which take an average of 40s as of the time of this writing. For comparison, Azure is normally <10, and AWS <4.

You will also note the load-balancer new backend shows similar high disruption, but after conversations with network edge we now know the code paths for these two are very different, thus we're filing this as a separate bug. The SLB bug is https://issues.redhat.com/browse/OCPBUGS-6796. The two may prove to be same cause in future, as they do appear similar, but not identical even in terms of when the problems occur.

Some example prob jows are easy to find as the disruption is on average there. Note that we do not typically fail a test on these as the disruption monitoring stack is built to try to pin where we're at now, and this is a long standing issue.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620744632478470144

This job was near successful but got 45s of disruption to image-registry-new. The disruption observed can always be seen in artifacts such as: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade/1620744632478470144/artifacts/e2e-gcp-ovn-upgrade/openshift-e2e-test/artifacts/junit/backend-disruption_20230201-120923.json

Expanding the first "Intervals - spyglass" chart on the main prowjob page, you can see when the disruption occurred and what else was going on in the cluster at that time.

This shows we're not getting a continuous 40+s of disruption, rather a few batches.

The ingress services all go down roughly together, the service load balancer pattern looks a little different, thus the different bug mentioned above.

For more examples just visit https://sippy.dptools.openshift.org/sippy-ng/jobs/4.13/runs?filters=%7B%22items%22%3A%5B%7B%22columnField%22%3A%22name%22%2C%22operatorValue%22%3A%22equals%22%2C%22value%22%3A%22periodic-ci-openshift-release-master-ci-4.13-upgrade-from-stable-4.12-e2e-gcp-ovn-upgrade%22%7D%5D%7D&sortField=timestamp&sort=desc, it will happen nearly every time.

When examining what else was going on when this happens, we see some clear patterns of nodes being updated.

clones

OCPBUGS-11458 Ingress Takes 40s on Average Downtime During GCP OVN Upgrades

Closed

is blocked by

OCPBUGS-11458 Ingress Takes 40s on Average Downtime During GCP OVN Upgrades

Closed

links to

openshift/ovn-kubernetes#1638: OCPBUGS-11701: CARRY: use "prefer local" for annotated services

Assignee:: Riccardo Ravaioli

Reporter:: OpenShift Prow Bot

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2023/04/12 10:30 AM

Updated:: 2024/02/15 3:31 PM

Resolved:: 2023/04/24 10:29 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates