Loading...

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: 4.22
Component/s: kube-apiserver
Labels:
- disruption
- rits-work

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
None
Regression:
Yes

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

TRT's disruption dashboard is showing an increase in apiserver related disruptions.

Here is a dashboard link where you can see the trend.

You can see that 90 percentile started climbing up from the beginning of February.

Delving in this particular example

We can see the disruption pattern happening around openshift-apiserver and openshift-oauth operator upgrade from interval chart.

Disruption is affecting all three servers: kube-apiserver, openshift-apiserver and openshift-oauth server.

Here is a typical error from the disruption file:

Feb 17 05:14:46.342 - 999ms E backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests reason/DisruptionBegan request-audit-id/64d4b179-13ca-41fe-9886-c07ad76f6e3e backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests stopped responding to GET requests over reused connections: Get \"https://api.ci-op-zfkj05xt-5bc1c.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443/api/v1/namespaces/default\": net/http: timeout awaiting response headers

Trace down that particular request from audit log:

master0.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"64d4b179-13ca-41fe-9886-c07ad76f6e3e","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/default","verb":"get","user":{"username":"system:admin","groups":["system:masters","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["X509SHA256=da3aeddbee0d53ac72cb689993e77763f29efc982b6c07e189d8713da4c54dd9"]}},"sourceIPs":["18.212.59.107"],"userAgent":"openshift-external-backend-sampler-reused-kube-api","objectRef":{"resource":"namespaces","namespace":"default","name":"default","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"context canceled","code":500},"requestReceivedTimestamp":"2026-02-17T05:14:46.340190Z","stageTimestamp":"2026-02-17T05:15:13.544368Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}

The request seemed timed out after 27s.

From the same interval chart, we can see the corresponding etcd event complaining about "response took too long". If you choose CloudMetrics, you also see azure disk queue depth is over the threshold (3).

I did a sampling of audit logs (random two jobs per day over the last month, and I got the following stats for duration spread for disruption sampling requests:

========================================================================================================================
CI JOB AUDIT LOG ANALYSIS - DAILY STATISTICS
External Backend Sampler Requests to /api/v1/namespaces/default
========================================================================================================================


Date         Count      Avg (ms)     Max (ms)     P50 (ms)     P75 (ms)     P95 (ms)    
------------------------------------------------------------------------------------------------------------------------
2026-01-18   166,984    25.35        21289.07     1.37         2.97         7.59        
2026-01-19   125,542    23.81        20509.70     1.09         2.08         8.51        
2026-01-20   164,487    28.12        18813.35     1.26         3.08         8.37        
2026-01-21   125,150    28.91        26632.49     1.21         2.22         6.02        
2026-01-22   165,007    27.42        26904.09     1.25         2.16         6.45        
2026-01-23   117,876    35.16        27204.11     1.15         2.29         7.06        
2026-01-24   163,768    29.36        25192.02     1.32         3.28         8.24        
2026-01-25   117,789    25.55        19749.69     1.24         2.17         5.73        
2026-01-26   123,434    31.51        27204.24     1.14         1.95         6.35        
2026-01-27   181,160    28.09        27213.22     1.44         3.08         7.47        
2026-01-28   120,696    33.98        25690.80     1.27         2.46         7.16        
2026-01-29   169,796    30.91        27203.09     1.25         2.68         7.08        
2026-01-30   169,850    50.86        27206.23     1.15         2.01         5.88        
2026-01-31   171,065    41.15        27203.45     1.22         2.70         6.80        
2026-02-01   168,144    41.96        27203.67     1.28         2.31         6.00        
2026-02-02   166,340    40.36        27204.00     1.23         2.86         7.74        
2026-02-03   167,057    47.34        27204.83     1.15         1.96         7.55        
2026-02-04   78,627     44.51        27204.10     1.35         2.56         6.56        
2026-02-05   82,668     50.67        27204.72     1.82         3.99         10.01       
2026-02-06   41,012     46.98        27203.31     1.16         1.92         6.97        
2026-02-07   80,345     39.23        27202.67     1.25         2.16         6.74        
2026-02-08   84,740     49.79        27203.42     1.72         3.80         8.75        
2026-02-10   158,116    40.56        27203.97     1.29         3.08         7.57        
2026-02-11   161,153    52.16        27205.30     1.12         2.19         4.99        
2026-02-12   121,335    39.61        27204.22     1.34         2.51         6.53        
2026-02-13   159,448    41.19        27203.65     1.30         3.01         8.35        
2026-02-14   168,968    36.88        26081.77     1.25         2.82         7.54        
2026-02-15   162,469    44.76        27203.33     1.29         3.16         7.77        
2026-02-16   114,157    48.78        27204.22     1.34         2.40         6.67        
2026-02-17   115,979    45.98        27203.84     1.26         2.21         6.78        
------------------------------------------------------------------------------------------------------------------------

Here are a few questions

What caused the behavior change?
Is the disk usage following the same trend as the disruption dashboard?
Is there any change in openshift-apiserver or maybe openshift-oauth during operator update that can explain the disk usage?
We have changes in origin that sets the timeout value to be 34s. We did this because it was supposed to be the timeout value kube-apiserver allows. Is there a change somewhere that we are only allowed 27s now?

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide