Loading...

Type: Bug
Resolution: Done-Errata
Priority: Undefined
Fix Version/s: 4.18.0
Affects Version/s: 4.18.0
Component/s: kube-apiserver
Labels:
- disruption
- trt-standup

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Important
Regression:
Yes

Target Backport Versions:
None
Target Version:

4.18.0
Release Blocker:
Approved
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Release Note Status:
Done
Release Note Type:
Release Note Not Required
Release Note Text:
n/a

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.

The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.

The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:

source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]

Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed

More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.

The operator degraded is probably the strongest symptom to persue as it appears in most of the above.

If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.

blocks

OCPBUGS-41874 kube-apiserver experiencing more disruption (4.17)

Closed

clones

OCPBUGS-41586 oauth-apiserver experiencing more disruption (4.18)

Closed

is cloned by

OCPBUGS-41874 kube-apiserver experiencing more disruption (4.17)

Closed

is related to

OCPBUGS-41617 openshift-apiserver experiencing more disruption (4.18)

Closed

links to

openshift/cluster-kube-apiserver-operator#1732: WIP: OCPBUGS-41778: increase kube-apiserver failureThreshold

RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update

(1 links to)

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates