Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.15.0
Affects Version/s: 4.14.0
Component/s: Networking / ovn-kubernetes
Labels:
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:

4.15.0
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
In Progress
Release Note Type:
Bug Fix
Release Note Text:

Hide
Cause: adding a subnet mask to the annotation "k8s.ovn.org/host-addresses"

Consequence: The nodes would update and remove the subnet mask from the annotation "k8s.ovn.org/host-addresses" which would cause issues with the master pods until they where also updated.

Fix: create an additional annotation "k8s.ovn.org/host-cidrs", during update the nodes will leave "k8s.ovn.org/host-addresses" annotation alone and when the masters fully upgrade they will recognize the new annotation as including the subnet mask

Result: Insignificant disruption during upgrades

Show
Cause: adding a subnet mask to the annotation "k8s.ovn.org/host-addresses" Consequence: The nodes would update and remove the subnet mask from the annotation "k8s.ovn.org/host-addresses" which would cause issues with the master pods until they where also updated. Fix: create an additional annotation "k8s.ovn.org/host-cidrs", during update the nodes will leave "k8s.ovn.org/host-addresses" annotation alone and when the masters fully upgrade they will recognize the new annotation as including the subnet mask Result: Insignificant disruption during upgrades

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

DISCLAIMER: The code for measuring disruption in-cluster is extremely new, we cannot be 100% confident what we're seeing is real, however the below bug is demonstrating a problem that is occurring in a very specific configuration, all others are unaffected, so this helps us gain some confidence what we're seeing is real.

https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-platform=aws&var-percentile=P50&var-backend=pod-to-host-new-connections&var-releases=4.14&var-upgrade_type=minor&var-networks=sdn&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=10&var-lookback=1&var-master_node_updated=Y&from=now-7d&to=now

affects pod-to-host-new-connections
affects aws minor upgrades are seeing over 14000s of disruption for the P50
does not affect pod-to-host-reused-connections
does not affect any other clouds
does not affect micro upgrades
does not affect pod-to-service or pod-to-pod backends
does not affect sdn

The total disruption comes from a number of pods which are added together, the actual duration of the disruption is roughly / 14. The actual disruption appears to be about 12 minutes and hits all pods doing pod-to-host monitoring simultaneously.

Sample job: (taken from expanding the "Most Recent Runs" panel in grafana)

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade/1698740856976052224

In the first spyglass chart for upgrade you can see the batch of disruption: 7:28:19 - 7:40:03

We do not have data prior to ovn interconnect landing, so we cannot say if this started at that time or not.

is cloned by

OCPBUGS-19813 [4.14] Significant 12 minute pod-to-host disruption detected on aws ovn minor upgrades

Closed

is depended on by

OCPBUGS-19813 [4.14] Significant 12 minute pod-to-host disruption detected on aws ovn minor upgrades

Closed

links to

openshift/ovn-kubernetes#1907: OCPBUGS-17641,OCPBUGS-18352,OCPBUGS-18549,OCPBUGS-19456,OCPBUGS-17455: [DownstreamMerge] 9-26-23

RHEA-2023:7198 rpm

upstream fix

Assignee:: Jacob Tanenbaum (Inactive)

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Created:: 2023/09/05 2:45 PM

Updated:: 2025/07/25 5:36 PM

Resolved:: 2024/02/27 9:00 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide