Loading...

XML

Word

Printable

Type: Bug
Resolution: Won't Do
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.17.0
Component/s: Etcd
Labels:
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Important
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
ETCD Sprint 257, ETCD Sprint 258, ETCD Sprint 259
sprint_count:
3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

TRT tooling has detected changes in disruption for most kube/openshift/oauth API backends. Data actually looked quite good up until around the end of June, better than 4.16, but then something seems to have changed.

There are a few patterns visible in the job runs, but for this bug we're focusing on one pattern that's fairly easy to find where around the time the kube-apiserver is rolling out a new revision, API endpoints take roughly 4-7s of disruption.

Using this dashboard link, we had a rather consistent P95 of 0s until around June 22-23 and things begin getting erratic.

This bug is the result of scanning the job list lower on the dashboard and identifying patterns in the disruption.

For this bug, the specific pattern we're looking at is 5-7s of disruption when the kube-apiserver is rolling out a new revision.

Examples:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1813938460914880512

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1813847818029240320

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1813213432246177792

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1811039682226556928

Open Debug Tools > Intervals or expand Spyglass charts to see the disruption and it's correlation to operator updates.

Error appears to always be net/http timeout awaiting response headers, and it affects both kube/openshift/oauth api, new and reused connections both.

Fairly easy to find more examples but we do have to watch out for one of the other two patterns. (I've seen several examples of disruption during node updates, and another bar earlier in the job runs, cause unknown, possible future bugs)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-08-02-13-07-36-806.png
2024/08/02 11:07 AM
94 kB
Thomas Jungblut
image-2024-08-02-13-10-10-268.png
2024/08/02 11:10 AM
140 kB
Thomas Jungblut
image-2024-08-02-13-11-09-710.png
2024/08/02 11:11 AM
144 kB
Thomas Jungblut
image-2024-08-02-13-17-32-016.png
2024/08/02 11:17 AM
126 kB
Thomas Jungblut
image-2024-08-02-13-35-19-481.png
2024/08/02 11:35 AM
224 kB
Thomas Jungblut
image-2024-08-02-13-59-06-529.png
2024/08/02 11:59 AM
132 kB
Thomas Jungblut
image-2024-08-02-14-01-33-811.png
2024/08/02 12:01 PM
304 kB
Thomas Jungblut
image-2024-08-02-14-18-53-795.png
2024/08/02 12:18 PM
81 kB
Thomas Jungblut
image-2024-08-02-14-19-03-571.png
2024/08/02 12:19 PM
101 kB
Thomas Jungblut

Assignee:: Dean West

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/07/19 12:39 PM

Updated:: 2025/07/22 5:36 AM

Resolved:: 2024/12/23 12:02 AM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates