-
Bug
-
Resolution: Obsolete
-
Major
-
None
-
4.17
-
None
-
MON Sprint 258, MON Sprint 259
-
2
-
Rejected
-
False
-
Description of problem:
TRT got an alert for a disruption regression for metrics-api backend. It seems that 95 percentile is getting 5s worse than 4.16 GA data. Here is a dashboard where you can view the trend: https://grafana-loki.ci.openshift.org/d/ISnBj4LVk/disruption?orgId=1&var-percentile=P95&var-platform=aws&var-backend=metrics-api-new-connections&var-upgrade_type=minor&var-master_nodes_updated=Y&var-architectures=amd64&var-topologies=ha&var-networks=ovn&var-releases=4.17 You can also scroll down to the "Last 500 Job" to see the list of jobs with their disruption count. You can click on one of them to go to prow job to investigate further. Here is an example job: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.17-upgrade-from-stable-4.16-e2e-aws-ovn-upgrade/1810404925734129664 It seems that the disruption is always happening when one of the masters is being rebooted.
How reproducible:
Steps to Reproduce:
1. 2. 3.
Actual results:
Expected results:
Additional info:
- is triggering
-
OCPBUGS-39133 Kube-aggregator reaching stale apiservice endpoints
- Verified