Loading...

XML

Word

Printable

Type: Bug
Resolution: Obsolete
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.14
Component/s: kube-apiserver
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
No

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

There are two common patterns to Azure disruption for kube-apiserver, we need to be very careful to focus just on one in this bug. The scenario we are chasing is:

a blip of disruption across kube-api (new and reused connections, same for openshift-api and oauth-api, but we focus on kube for simplicity)
sometimes also affects re-used connections
occurs long after the kube-api clusteroperator reports all nodes have reached new revision, about 20 minutes
occurs before node OS updates, if there are any (micro upgrade)
usually accompanied by kube-apiserver or guard pods emitting pod pod readiness problems

Note that we focus on kube-api here, but the disruption usually hits openshift-api and oauth-api. Ingress is not affected.

Intervals charts are key to initial investigation of these, I'm going to link directly to the new sippy intervals pages (see Intervals link under Debug Tools) which allows me to prepopulate the filters I'm looking at to isolate the problem. Removing these filters may help you see more, these are just a starting point.

Sample job runs:

Sample job 1 (intervals) (loki logs): you can see the disruption at 3:53:34, accompanied by pod readiness problems below.

Sample job 2 (intervals): Disruption at 12:54:45. No loki logs.

Sample job 3 (intervals): Disruption at 1:02:41PM. No Loki logs.

The loki logs were recently re-enabled for this job. (see Debug Tools on the prow job)

This pattern is quite common, disruption prior to node reboots, well after clusteroperator reports no longer progressing, and pod readiness problems for kube-apiserver.

Again removing my regex may help reveal additional details as to what's going on.

Solving this issue will help alleviate a large chunk of the disruption we see on Azure which is an outlier in terms of the amount of disruption it experiences.

More details coming.

I think the problem appears on 4.14 Azure micro SDN upgrades as well:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade/1655237865774256128

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade/1655579993335402496

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-e2e-azure-sdn-upgrade/1655533239072198656

So this is not likely an ovn problem.

It also seems to affect minor upgrades from 4.13:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1655904572033470464

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.14-upgrade-from-stable-4.13-e2e-azure-sdn-upgrade/1655904571219775488

For reasons I cannot explain, this issue almost disappeared for about 4 days between Apr 30 and May 3 inclusive. It is now back. The overall disruption graph is available to see the ebb and flow of this particular job, but it has been quite erratic. This page also shows the most recent 500 runs of the job if you ever need additional samples.

Assignee:: Unassigned

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: Ke Wang

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/05/09 5:36 PM

Updated:: 2025/07/26 11:39 PM

Resolved:: 2023/08/08 2:35 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide