-
Bug
-
Resolution: Done-Errata
-
Undefined
-
None
-
4.17.0
-
Important
-
Yes
-
Rejected
-
False
-
-
N/A
-
Release Note Not Required
-
Done
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
- depends on
-
OCPBUGS-41586 oauth-apiserver experiencing more disruption (4.18)
- Verified
- is cloned by
-
OCPBUGS-41586 oauth-apiserver experiencing more disruption (4.18)
- Verified
- links to
-
RHSA-2024:3718 OpenShift Container Platform 4.17.0 bug fix and security update