Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: None
Affects Version/s: 4.14
Component/s: Networking / ovn-kubernetes
Labels:
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
SDN Sprint 242
sprint_count:
1

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

4.14.0-0.nightly-2023-09-04-224539 failed periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade failed due to multiple kubectl timeout issues.

Investigating periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-gcp-ovn-rt-upgrade/1698830727191203840 shows a large disruption event within the cluster based on the host-to-pod and pod-to-pod monitoring as observed in e2e-timelines_spyglass_20230905-011352.html

Version-Release number of selected component (if applicable):

How reproducible:

Multiple failures on payload 4.14.0-0.nightly-2023-09-04-224539

Steps to Reproduce:

1.
2.
3.

Actual results:

Expected results:

We do not expect disruption / failures invoking kubectl commands

Additional info:

Frequency

I believe querying for high host-to-host-new-connections disruption appears to be very good at finding these failures, the test failures will be somewhat random so I think this might be our best bet.

this query shows that with around 400 rt runs and 40 non-rt runs, both are experiencing a fail rate of 2-3%. However that is only one hit for non-rt. I am going to slowly kick off a few more jobs there to try to get more data and reproduce more hits.

This dashboard can be used to find the specific prow jobs with over 120s of host-to-host. Thus far everything I've opened exhibits the same "blood spatter" disruption pattern during the e2e phase of testing.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

5-fails-kubctl-tests.png
2023/09/14 4:11 PM
2.93 MB
Dennis Periquet
image-2023-09-05-09-43-44-810.png
2023/09/05 1:43 PM
223 kB
Forrest Babcock
screenshot-1.png
2023/09/06 3:34 PM
305 kB
Forrest Babcock

is related to

OCPBUGS-19388 Observed increase in CPU usage during E2E testing with cgroupsv2

Closed

Assignee:: Jaime Caamaño Ruiz

Reporter:: Forrest Babcock

Need Info From:: None

Contributors:: None

QA Contact:: Zhanqi Zhao

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Created:: 2023/09/05 1:47 PM

Updated:: 2025/07/25 5:36 PM

Resolved:: 2023/09/29 2:39 PM

Details

Description

Frequency

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide