Details
-
Bug
-
Resolution: Obsolete
-
Normal
-
None
-
4.14
-
Moderate
-
No
-
False
-
Description
There are two common patterns to Azure disruption for kube-apiserver, we need to be very careful to focus just on one in this bug. The scenario we are chasing is:
- a blip of disruption across kube-api (new and reused connections, same for openshift-api and oauth-api, but we focus on kube for simplicity)
- sometimes also affects re-used connections
- occurs long after the kube-api clusteroperator reports all nodes have reached new revision, about 20 minutes
- occurs before node OS updates, if there are any (micro upgrade)
- usually accompanied by kube-apiserver or guard pods emitting pod pod readiness problems
Note that we focus on kube-api here, but the disruption usually hits openshift-api and oauth-api. Ingress is not affected.
Intervals charts are key to initial investigation of these, I'm going to link directly to the new sippy intervals pages (see Intervals link under Debug Tools) which allows me to prepopulate the filters I'm looking at to isolate the problem. Removing these filters may help you see more, these are just a starting point.
Sample job runs:
Sample job 1 (intervals) (loki logs): you can see the disruption at 3:53:34, accompanied by pod readiness problems below.
Sample job 2 (intervals): Disruption at 12:54:45. No loki logs.
Sample job 3 (intervals): Disruption at 1:02:41PM. No Loki logs.
The loki logs were recently re-enabled for this job. (see Debug Tools on the prow job)
This pattern is quite common, disruption prior to node reboots, well after clusteroperator reports no longer progressing, and pod readiness problems for kube-apiserver.
Again removing my regex may help reveal additional details as to what's going on.
Solving this issue will help alleviate a large chunk of the disruption we see on Azure which is an outlier in terms of the amount of disruption it experiences.
More details coming.
I think the problem appears on 4.14 Azure micro SDN upgrades as well:
So this is not likely an ovn problem.
It also seems to affect minor upgrades from 4.13:
For reasons I cannot explain, this issue almost disappeared for about 4 days between Apr 30 and May 3 inclusive. It is now back. The overall disruption graph is available to see the ebb and flow of this particular job, but it has been quite erratic. This page also shows the most recent 500 runs of the job if you ever need additional samples.