-
Bug
-
Resolution: Won't Do
-
Normal
-
None
-
4.14.0
-
Important
-
No
-
Rejected
-
False
-
Description of problem:
With the phase1->phase2 upgrade, we know there is going to be an outage for each node (around 400ms) as routing is updated to move from local zone to the remote zone. We can now see evidence of this outage with the new host->pod disruption tests that were added: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-upgrade/1693467197642379264 In this run when phase1->phase2 upgrades master-0, there is an outage from 5:06:14 - 5:06:15 reported by clients master-2, worker-a, worker-c. master-0 migrates and completes: I0821 05:06:13.935325 200044 default_node_network_controller.go:946] Upgrade hack: ovnkube-node ci-op-8rs66hs2-915a5-jlkmv-master-0 finished setting DB Auth; took: 1.251122756s master-2 notices the migration and updates its routes: I0821 05:06:13.915230 1 master.go:917] Node "ci-op-8rs66hs2-915a5-jlkmv-master-0" moved from the local zone global to a remote zone ci-op-8rs66hs2-915a5-jlkmv-master-0. Cleaning the node resources I0821 05:06:14.207955 1 zone_ic_handler.go:252] Creating Interconnect resources for node ci-op-8rs66hs2-915a5-jlkmv-master-0 took: 41.349581ms Here the outage time is what we expected and a single curl fails from all clients. However, there is a chance to tighten this window even more. When master-2 identifies master-0 is migrating we do the following on master-2: 1. We clean up the remote nodes "local/legacy" setup (GR, node switch, etc). We also delete the SBDB chassis. 2. master-2 then creates the SBDB chassis again as remote. 3. master-2 then creates remote resources like static route, transit switch, port binding etc. All of these are done as separate OVSDB txns. When we execute 1, we end up breaking connectivity earlier than we need to. What we should do is bundle all of these into the same OVSDB transaction. Additionally there is no need to delete the SBDB chassis in step 1, when step 2 will simply update it as needed (instead of recreating it). Realistically I think this change will only drop potentially .1 or .2 seconds of outage, but its an easy change that make things more atomic.