Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77014

Disruptions with apiservers on azure increased over the last month

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • Yes
    • None
    • None
    • Approved
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      TRT's disruption dashboard is showing an increase in apiserver related disruptions.

      Here is a dashboard link where you can see the trend.

      You can see that 90 percentile started climbing up from the beginning of February.

      Delving in this particular example

      We can see the disruption pattern happening around openshift-apiserver and openshift-oauth operator upgrade from interval chart.

       

      Disruption is affecting all three servers: kube-apiserver, openshift-apiserver and openshift-oauth server.

       

      Here is a typical error from the disruption file:

      Feb 17 05:14:46.342 - 999ms E backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests reason/DisruptionBegan request-audit-id/64d4b179-13ca-41fe-9886-c07ad76f6e3e backend-disruption-name/kube-api-reused-connections connection/reused disruption/openshift-tests stopped responding to GET requests over reused connections: Get \"https://api.ci-op-zfkj05xt-5bc1c.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX:6443/api/v1/namespaces/default\": net/http: timeout awaiting response headers 

       

      Trace down that particular request from audit log:

      master0.log:{"kind":"Event","apiVersion":"audit.k8s.io/v1","level":"Metadata","auditID":"64d4b179-13ca-41fe-9886-c07ad76f6e3e","stage":"ResponseComplete","requestURI":"/api/v1/namespaces/default","verb":"get","user":{"username":"system:admin","groups":["system:masters","system:authenticated"],"extra":{"authentication.kubernetes.io/credential-id":["X509SHA256=da3aeddbee0d53ac72cb689993e77763f29efc982b6c07e189d8713da4c54dd9"]}},"sourceIPs":["18.212.59.107"],"userAgent":"openshift-external-backend-sampler-reused-kube-api","objectRef":{"resource":"namespaces","namespace":"default","name":"default","apiVersion":"v1"},"responseStatus":{"metadata":{},"status":"Failure","message":"context canceled","code":500},"requestReceivedTimestamp":"2026-02-17T05:14:46.340190Z","stageTimestamp":"2026-02-17T05:15:13.544368Z","annotations":{"authorization.k8s.io/decision":"allow","authorization.k8s.io/reason":""}} 

       

      The request seemed timed out after 27s. 

       

      From the same interval chart, we can see the corresponding etcd event complaining about "response took too long".  If you choose CloudMetrics, you also see azure disk queue depth is over the threshold (3). 

       

      I did a sampling of audit logs (random two jobs per day over the last month, and I got the following stats for duration spread for disruption sampling requests:

       

      ========================================================================================================================
      CI JOB AUDIT LOG ANALYSIS - DAILY STATISTICS
      External Backend Sampler Requests to /api/v1/namespaces/default
      ========================================================================================================================
      
      
      Date         Count      Avg (ms)     Max (ms)     P50 (ms)     P75 (ms)     P95 (ms)    
      ------------------------------------------------------------------------------------------------------------------------
      2026-01-18   166,984    25.35        21289.07     1.37         2.97         7.59        
      2026-01-19   125,542    23.81        20509.70     1.09         2.08         8.51        
      2026-01-20   164,487    28.12        18813.35     1.26         3.08         8.37        
      2026-01-21   125,150    28.91        26632.49     1.21         2.22         6.02        
      2026-01-22   165,007    27.42        26904.09     1.25         2.16         6.45        
      2026-01-23   117,876    35.16        27204.11     1.15         2.29         7.06        
      2026-01-24   163,768    29.36        25192.02     1.32         3.28         8.24        
      2026-01-25   117,789    25.55        19749.69     1.24         2.17         5.73        
      2026-01-26   123,434    31.51        27204.24     1.14         1.95         6.35        
      2026-01-27   181,160    28.09        27213.22     1.44         3.08         7.47        
      2026-01-28   120,696    33.98        25690.80     1.27         2.46         7.16        
      2026-01-29   169,796    30.91        27203.09     1.25         2.68         7.08        
      2026-01-30   169,850    50.86        27206.23     1.15         2.01         5.88        
      2026-01-31   171,065    41.15        27203.45     1.22         2.70         6.80        
      2026-02-01   168,144    41.96        27203.67     1.28         2.31         6.00        
      2026-02-02   166,340    40.36        27204.00     1.23         2.86         7.74        
      2026-02-03   167,057    47.34        27204.83     1.15         1.96         7.55        
      2026-02-04   78,627     44.51        27204.10     1.35         2.56         6.56        
      2026-02-05   82,668     50.67        27204.72     1.82         3.99         10.01       
      2026-02-06   41,012     46.98        27203.31     1.16         1.92         6.97        
      2026-02-07   80,345     39.23        27202.67     1.25         2.16         6.74        
      2026-02-08   84,740     49.79        27203.42     1.72         3.80         8.75        
      2026-02-10   158,116    40.56        27203.97     1.29         3.08         7.57        
      2026-02-11   161,153    52.16        27205.30     1.12         2.19         4.99        
      2026-02-12   121,335    39.61        27204.22     1.34         2.51         6.53        
      2026-02-13   159,448    41.19        27203.65     1.30         3.01         8.35        
      2026-02-14   168,968    36.88        26081.77     1.25         2.82         7.54        
      2026-02-15   162,469    44.76        27203.33     1.29         3.16         7.77        
      2026-02-16   114,157    48.78        27204.22     1.34         2.40         6.67        
      2026-02-17   115,979    45.98        27203.84     1.26         2.21         6.78        
      ------------------------------------------------------------------------------------------------------------------------ 

       

      Here are a few questions 

      • What caused the behavior change?
      • Is the disk usage following the same trend as the disruption dashboard?
      • Is there any change in openshift-apiserver or maybe openshift-oauth during operator update that can explain the disk usage? 
      • We have changes in origin that sets the timeout value to be 34s. We did this because it was supposed to be the timeout value kube-apiserver allows. Is there a change somewhere that we are only allowed 27s now?

       

       

              Unassigned Unassigned
              kenzhang@redhat.com Ken Zhang
              None
              None
              Ke Wang Ke Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: