Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14.0
Component/s: Multi-Arch
Labels:
- disruption
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Proposed
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&from=now-90d&to=now&var-platform=aws&var-percentile=P50&var-backend=ingress-to-console-new-connections&var-backend=service-load-balancer-with-pdb-new-connections&var-releases=4.14&var-upgrade_type=micro&var-networks=ovn&var-topologies=ha&var-architectures=amd64&var-min_job_runs=1&var-lookback=1&var-master_node_updated=Y

In the above graph we can see that sometime shortly after Aug 11, disruption spiked severely for new and reused connections, to all ingress related backends.

Expanding the Most Recent Job Runs panel on the above link shows that all the bad results are coming from periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade. This job is seeing 100-600s of disruption, whereas the normal non-multi-arch job is typically 0-1s.

Two sample jobs:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade/1698740856976052224

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-multiarch-master-nightly-4.14-ocp-e2e-aws-ovn-heterogeneous-upgrade/1698753834618195968

Expanding the spyglass chart, we see the disruption happens at roughly the same time for all backends and lasts minutes.

Using a spyglass search string of "disrupt|OVN|ovn", it's possible this is OVN struggling at this time? There are alerts around this time. It's curious that only ingress related backends are showing disruption however, the apiservers all seem ok.

Assignee:: Jeff Young

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Doug Slavens (Inactive)

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Created:: 2023/09/05 2:19 PM

Updated:: 2025/07/25 5:36 PM

Resolved:: 2023/09/13 11:31 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates