-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.19.z, 4.20
-
None
-
Quality / Stability / Reliability
-
False
-
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
During a loaded upgrade test on an AWS cluster with IPsec enabled, two primary performance issues were observed:
1. Etcd Fsync Latency: The 10-minute average 99th percentile etcd fsync latency on the master node etcd-ip-10-0-86-131.us-west-2.compute.internal exceeded the 10ms threshold, with a recorded value of 0.01s.
2. API Call Latency: The 10-minute average 99th percentile mutating API call latency for POST/rolebindings was significantly high at 3.97s, which is well above the 1-second alert threshold. These issues point to a performance regression in the etcd component during high-load operations, specifically affecting API server responsiveness. The logs indicate numerous timeouts (Timeout: request did not complete within requested timeout - context deadline exceeded) when creating Service objects, which directly correlates with the observed high API latency.
We are executing :
ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.20-nightly-x86-loaded-upgrade-from-4.19.yaml
Jira ticket : https://issues.redhat.com/projects/OCPQE/issues/OCPQE-30376
PR : https://github.com/openshift/release/pull/68948
Version-Release number of selected component (if applicable):
OpenShift Version: 4.18z -> 4.20 UpgradeCluster Configuration: 3 masters, 24 workers, 3 infra nodes on AWS with IPsec enabled.Test: loaded-upgrade-419to420-24nodes-udn-72iterations
How reproducible:
The issue has been observed to be reproducible across multiple test runs and in different AWS regions, indicating a consistent problem rather than an isolated incident.
Steps to Reproduce:
1. Install an OpenShift cluster on AWS with 3 master and 24 worker nodes, enabling IPsec.
2. Upgrade the cluster from OpenShift version 4.19.z to 4.20.z.
3. Execute the udn-density-pods benchmark test with kube-burner, specifically the loaded-upgrade-419to420-24nodes-udn-72iterations job.
4. Monitor the etcd and API server latency metrics during the test.
Increase the timeout of this test ci-operator/config/openshift-eng/ocp-qe-perfscale-ci/openshift-eng-ocp-qe-perfscale-ci-main__aws-4.20-nightly-x86-loaded-upgrade-from-4.19.yaml as shown in the above PR linked. Run "make update" and "make jobs" in your terminal and push the changes. Trigger the prow job by commenting this command "/pj-rehearse pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-ipsec-4.20-nightly-x86-loaded-upgrade-from-4.19-loaded-upgrade-419to420-24nodes-udn-72iterations".
Actual results:
The cluster performance degraded significantly during the test, leading to API call timeouts and warnings for high etcd fsync latency.
etcd fsync latency alert: 10 minutes avg. 99th etcd fsync latency on etcd-ip-10-0-86-131.us-west-2.compute.internal higher than 10ms. 0.01s
mutating API call latency alert: 10 minutes avg. 99th mutating API call latency for POST/rolebindings higher than 1 second. 3.97s
Multiple Error creating object Service messages due to timeouts.
Expected results:
The OpenShift cluster and its core components, including etcd and the API server, should maintain stable and acceptable performance metrics during a loaded upgrade and density tests, with no significant regression in latency compared to previous versions. The etcd fsync latency and API call latency should not exceed their configured warning thresholds.
Additional info:
This bug report is based on findings from a test run identified by the UUID ecaad422-796a-4588-a3ab-73cbd8a20e81. The test was executed as part of the rehearse-68948-pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-ipsec-4.20-nightly-x86-loaded-upgrade-from-4.19-loaded-upgrade-419to420-24nodes-udn-72iterations Prow job. The test logs and a related Slack discussion confirm that the high latency and timeouts are consistently observed, indicating a potential performance regression in OpenShift versions 4.19 and 4.20 on AWS IPsec clusters.
time="2025-09-05 18:52:07" level=warning msg="🚨 alert at 2025-09-05T18:48:03Z: '10 minutes avg. 99th etcd fsync latency on etcd-ip-10-0-86-131.us-west-2.compute.internal higher than 10ms. 0.01s'" file="alert_manager.go:206"
time="2025-09-05 18:52:08" level=warning msg="🚨 alert at 2025-09-05T18:45:03Z: '10 minutes avg. 99th mutating API call latency for POST/rolebindings higher than 1 second. 3.97s'" file="alert_manager.go:206"
Log Link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/test-platform-results/pr-logs/pull/openshift_release/68948/rehearse-68948-pull-ci-openshift-eng-ocp-qe-perfscale-ci-main-aws-ipsec-4.20-nightly-x86-loaded-upgrade-from-4.19-loaded-upgrade-419to420-24nodes-udn-72iterations/1964013141792657408/artifacts/loaded-upgrade-419to420-24nodes-udn-72iterations/openshift-qe-udn-density-pods/build-log.txt
Slack link : https://redhat-internal.slack.com/archives/C03ABT5822W/p1757341436523499