-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
4.22
-
None
-
False
-
-
None
-
None
-
Yes
-
None
-
None
-
Approved
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
TRT's disruption dashboard is showing an increase in apiserver related disruptions.
Here is a dashboard link where you can see the trend.
You can see that 90 percentile started climbing up from the beginning of February.
Delving in this particular example
We can see the disruption pattern happening around openshift-apiserver and openshift-oauth operator upgrade from interval chart.
Disruption is affecting all three servers: kube-apiserver, openshift-apiserver and openshift-oauth server.
Here is a typical error from the disruption file:
Feb 17 05:14:46.342 - 999ms E backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests reason/DisruptionBegan request-audit-id/64d4b179-13ca-41fe-9886-c07ad76f6e3e backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests stopped responding to GET requests over reused connections: Get \"https://api.ci-op-zfkj05xt-5bc1c.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443/api/v1/namespaces/default\": net/http: timeout awaiting response headers
Trace down that particular request from audit log:
master0.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"64d4b179-13ca-41fe-9886-c07ad76f6e3e","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/default","verb":"get","user":{"username":"system:admin","groups":["system:masters","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["X509SHA256=da3aeddbee0d53ac72cb689993e77763f29efc982b6c07e189d8713da4c54dd9"]}},"sourceIPs":["18.212.59.107"],"userAgent":"openshift-external-backend-sampler-reused-kube-api","objectRef":{"resource":"namespaces","namespace":"default","name":"default","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"context canceled","code":500},"requestReceivedTimestamp":"2026-02-17T05:14:46.340190Z","stageTimestamp":"2026-02-17T05:15:13.544368Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}}
The request seemed timed out after 27s.
From the same interval chart, we can see the corresponding etcd event complaining about "response took too long". If you choose CloudMetrics, you also see azure disk queue depth is over the threshold (3).
I did a sampling of audit logs (random two jobs per day over the last month, and I got the following stats for duration spread for disruption sampling requests:
========================================================================================================================
CI JOB AUDIT LOG ANALYSIS - DAILY STATISTICS
External Backend Sampler Requests to /api/v1/namespaces/default
========================================================================================================================
Date Count Avg (ms) Max (ms) P50 (ms) P75 (ms) P95 (ms)
------------------------------------------------------------------------------------------------------------------------
2026-01-18 166,984 25.35 21289.07 1.37 2.97 7.59
2026-01-19 125,542 23.81 20509.70 1.09 2.08 8.51
2026-01-20 164,487 28.12 18813.35 1.26 3.08 8.37
2026-01-21 125,150 28.91 26632.49 1.21 2.22 6.02
2026-01-22 165,007 27.42 26904.09 1.25 2.16 6.45
2026-01-23 117,876 35.16 27204.11 1.15 2.29 7.06
2026-01-24 163,768 29.36 25192.02 1.32 3.28 8.24
2026-01-25 117,789 25.55 19749.69 1.24 2.17 5.73
2026-01-26 123,434 31.51 27204.24 1.14 1.95 6.35
2026-01-27 181,160 28.09 27213.22 1.44 3.08 7.47
2026-01-28 120,696 33.98 25690.80 1.27 2.46 7.16
2026-01-29 169,796 30.91 27203.09 1.25 2.68 7.08
2026-01-30 169,850 50.86 27206.23 1.15 2.01 5.88
2026-01-31 171,065 41.15 27203.45 1.22 2.70 6.80
2026-02-01 168,144 41.96 27203.67 1.28 2.31 6.00
2026-02-02 166,340 40.36 27204.00 1.23 2.86 7.74
2026-02-03 167,057 47.34 27204.83 1.15 1.96 7.55
2026-02-04 78,627 44.51 27204.10 1.35 2.56 6.56
2026-02-05 82,668 50.67 27204.72 1.82 3.99 10.01
2026-02-06 41,012 46.98 27203.31 1.16 1.92 6.97
2026-02-07 80,345 39.23 27202.67 1.25 2.16 6.74
2026-02-08 84,740 49.79 27203.42 1.72 3.80 8.75
2026-02-10 158,116 40.56 27203.97 1.29 3.08 7.57
2026-02-11 161,153 52.16 27205.30 1.12 2.19 4.99
2026-02-12 121,335 39.61 27204.22 1.34 2.51 6.53
2026-02-13 159,448 41.19 27203.65 1.30 3.01 8.35
2026-02-14 168,968 36.88 26081.77 1.25 2.82 7.54
2026-02-15 162,469 44.76 27203.33 1.29 3.16 7.77
2026-02-16 114,157 48.78 27204.22 1.34 2.40 6.67
2026-02-17 115,979 45.98 27203.84 1.26 2.21 6.78
------------------------------------------------------------------------------------------------------------------------
Here are a few questions
- What caused the behavior change?
- Is the disk usage following the same trend as the disruption dashboard?
- Is there any change in openshift-apiserver or maybe openshift-oauth during operator update that can explain the disk usage?
- We have changes in origin that sets the timeout value to be 34s. We did this because it was supposed to be the timeout value kube-apiserver allows. Is there a change somewhere that we are only allowed 27s now?