Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 4.19.0
Component/s: kube-apiserver
Labels:
- disruption
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Beginning around 1/15 an increase in disruption has been detected across openshift-api backends.

This frequently happens during conformance testing after any upgrade has completed as seen in 4.19-e2e-aws-ovn-upgrade/1879411183967014912 - This job upgrades to `release:4.19.0-0.nightly-2025-01-15-060507` and at the moment is the earliest detection of the pattern described here.

Disruption is shown in the conformance intervals - spyglass_20250115-075108:

Additionally at the same time in the intervals are pod log entries for etcd indicating 'apply request took too long':

In the event-filter at that time (07:54 in this case) we see ProbeError and Unhealthy event reasons for openshift-kube-apiserver / openshift-apiserver.

Looking at PromeCIeus using

histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{subresource!="log",verb!~"WATCH|WATCHLIST|PROXY"}[5m])) by(resource,le))

we can see the request duration spike for a number of resources at that time:

Reviewing backend-disruption_20250115-075108.json and selecting the request-audit-id for one of the failures like:

"Jan 15 07:54:04.483 - 999ms E backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests reason/DisruptionBegan request-audit-id/6a2233a6-4f0f-435a-b11d-132185f218bf backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests stopped responding to GET requests over reused connections: Get \"https://api.ci-op-9f1drtlh-36e26.aws-2.ci.openshift.org:6443/api/v1/namespaces/default\": net/http: timeout awaiting response headers",

we can locate that entry in the audit logs and see the latency appears to be with etcd

"auditID":"6a2233a6-4f0f-435a-b11d-132185f218bf"
"apiserver.latency.k8s.io/etcd":"16.000126231s"
"apiserver.latency.k8s.io/total":"16.006017506s"

"Failure","message":"context canceled","code":500

Looking to start this investigation with apiserver and potentially rit team though obviously understanding the cause of the etcd latency is key.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2025-02-25-20-09-11-679.png
28 kB
2025/02/26 1:09 AM
screenshot-1.png
15 kB
2025/02/24 3:53 PM
screenshot-2.png
30 kB
2025/02/24 3:53 PM
screenshot-3.png
179 kB
2025/02/24 3:54 PM
screenshot-4.png
120 kB
2025/02/26 12:46 AM

is duplicated by

OCPBUGS-49817 APIServer disruption regression during network operator upgrade

Closed

Assignee:: Unassigned

Reporter:: Forrest Babcock

QA Contact:: Ke Wang

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/02/22 2:37 PM

Updated:: 2025/07/15 1:47 PM

Resolved:: 2025/03/16 1:30 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates