Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.0
Component/s: Networking / ovn-kubernetes
Labels:
- OVNK-Interconnect

Severity:
Important
Regression:
No
Release Blocker:
Rejected
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Target Version:

4.14.0

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

With the phase1->phase2 upgrade, we know there is going to be an outage for each node (around 400ms) as routing is updated to move from local zone to the remote zone. We can now see evidence of this outage with the new host->pod disruption tests that were added:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-upgrade/1693467197642379264

In this run when phase1->phase2 upgrades master-0, there is an outage from 5:06:14 - 5:06:15 reported by clients master-2, worker-a, worker-c.

master-0 migrates and completes:
I0821 05:06:13.935325 200044 default_node_network_controller.go:946] Upgrade hack: ovnkube-node ci-op-8rs66hs2-915a5-jlkmv-master-0 finished setting DB Auth; took: 1.251122756s

master-2 notices the migration and updates its routes:
I0821 05:06:13.915230 1 master.go:917] Node "ci-op-8rs66hs2-915a5-jlkmv-master-0" moved from the local zone global to a remote zone ci-op-8rs66hs2-915a5-jlkmv-master-0. Cleaning the node resources

I0821 05:06:14.207955 1 zone_ic_handler.go:252] Creating Interconnect resources for node ci-op-8rs66hs2-915a5-jlkmv-master-0 took: 41.349581ms

Here the outage time is what we expected and a single curl fails from all clients. However, there is a chance to tighten this window even more. When master-2 identifies master-0 is migrating we do the following on master-2:

1. We clean up the remote nodes "local/legacy" setup (GR, node switch, etc). We also delete the SBDB chassis.
2. master-2 then creates the SBDB chassis again as remote.
3. master-2 then creates remote resources like static route, transit switch, port binding etc.

All of these are done as separate OVSDB txns. When we execute 1, we end up breaking connectivity earlier than we need to. What we should do is bundle all of these into the same OVSDB transaction. Additionally there is no need to delete the SBDB chassis in step 1, when step 2 will simply update it as needed (instead of recreating it).

Realistically I think this change will only drop potentially .1 or .2 seconds of outage, but its an easy change that make things more atomic.

Assignee:: Surya Seetharaman

Reporter:: Tim Rozet

QA Contact:: Anurag Saxena

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2023/08/22 3:38 PM

Updated:: 2023/11/21 6:21 PM

Resolved:: 2023/11/21 6:21 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates