-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.18.0
-
Important
-
Yes
-
Approved
-
False
-
TRT has detected a consistent long term trend where the oauth-apiserver appears to have more disruption than it did in 4.16, for minor upgrades on azure.
The problem appears roughly over the 90th percentile, we picked it up at P95 where it shows a consistent 5-8s more than we'd expect given the data in 4.16 ga.
The problem hits ONLY oauth, affecting both new and reused connections, as well as the cached variants meaning etcd should be out of the picture. You'll see a few very short blips where all four of these backends lose connectivity for ~1s throughout the run, several times over. It looks like it may be correlated to the oauth-operator reporting:
source/OperatorAvailable display/true condition/Available reason/APIServices_Error status/False APIServicesAvailable: apiservices.apiregistration.k8s.io/v1.user.openshift.io: not available: endpoints for service/api in "openshift-oauth-apiserver" have no addresses with port name "https" [2s]
Sample jobs:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669509775822848
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669509775822848/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122854.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827669493837467648
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827669493837467648/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240825-122623.json&overrideDisplayFlag=1&selectedSources=OperatorAvailable&selectedSources=OperatorProgressing&selectedSources=OperatorDegraded&selectedSources=KubeletLog&selectedSources=EtcdLog&selectedSources=EtcdLeadership&selectedSources=Alert&selectedSources=Disruption&selectedSources=E2EFailed&selectedSources=APIServerGracefulShutdown&selectedSources=KubeEvent&selectedSources=NodeState
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/1827077182283845632
Intervals: https://sippy.dptools.openshift.org/sippy-ng/job_runs/1827077182283845632/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-azure-ovn-upgrade/intervals?filterText=&intervalFile=e2e-timelines_spyglass_20240823-212127.json&overrideDisplayFlag=1&selectedSources=OperatorDegraded&selectedSources=EtcdLog&selectedSources=Disruption&selectedSources=E2EFailed
More can be found using the first link to the dashboard in this post and scrolling down to most recent job runs, and looking for high numbers.
The operator degraded is probably the strongest symptom to persue as it appears in most of the above.
If you find any runs where other backends are disrupted, especially kube-api, I would suggest ignoring those as they are unlikely to be the same fingerprint as the error being described here.
- blocks
-
OCPBUGS-41867 oauth-apiserver experiencing more disruption (4.17)
- Closed
- clones
-
OCPBUGS-39021 oauth-apiserver experiencing more disruption in 4.17
- Closed
- is cloned by
-
OCPBUGS-41617 openshift-apiserver experiencing more disruption (4.18)
- Verified
-
OCPBUGS-41778 kube-apiserver experiencing more disruption (4.18)
- Verified
-
OCPBUGS-41867 oauth-apiserver experiencing more disruption (4.17)
- Closed
- is depended on by
-
OCPBUGS-39021 oauth-apiserver experiencing more disruption in 4.17
- Closed
- links to
-
RHEA-2024:6122 OpenShift Container Platform 4.18.z bug fix update