-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.18
-
Quality / Stability / Reliability
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
In GCP-OVN-FIPS-EtcdEncrypt cluster's node-density-heavy test 4.18 P99 Ready latency (162.00s) is longer than 4.17 (132.00s)
We are executing openshift-qe/ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.17-nightly-x86.yaml#L45 & openshift-qe/ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.18-nightly-x86.yaml#L45
GitHub PR in which the test was executed: https://github.com/openshift/release/pull/59993 (you will find the prow job links in the comments)
Version-Release number of selected component (if applicable):
4.17.0-0.nightly-2025-03-03-235758 & 4.18.0-0.nightly-2025-03-03-231444
How reproducible:
Observed twice, will do more tests to know if it is reproducible (added the result below in the additional info section).
Steps to Reproduce:
- Install a GCP cluster with 3 masters, 24 workers, 3 infras(m5.xlarge(4vCPU/16 GiB memory))
- Run node-density-heavy-etcd-fips-24nodes test (4.17 test link & 4.18 test link)
- Check the metrics of P99 Ready latency after the test and compare 4.18 with 4.17
1. Increase the timeout of this test ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.18-nightly-x86.yaml#L45 as shown in the above PR linked.
2. Run "make update" and "make jobs" in your terminal and push the changes.
3. Trigger the prow job by commenting this command "/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-gcp-4.18-nightly-x86-node-density-heavy-etcd-fips-24nodes".
Actual results:
4.18 P99 Ready latency (162.00s) is longer than 4.17 (132.00s)
Expected results:
There is no regression in 4.18 P99 Ready latency
Additional info:
For GCP-OVN-FIPS-EtcdEncrypt job runs
1 job for 4.17
Mar 04 job
time="2025-03-04 16:14:23" level=error msg="podLatency: P99 Ready latency (132.00s) higher than configured threshold: 2m0s" file="job.go:164" time="2025-03-04 16:22:55" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-dndsyry9-83979.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121" time="2025-03-04 16:22:55" level=warning msg="🚨 alert at 2025-03-04T16:05:18Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-dndsyry9-83979-tt82n-master-0 higher than 10ms. 0.011s'" file="alert_manager.go:201" time="2025-03-04 16:22:55" level=warning msg="🚨 alert at 2025-03-04T16:09:18Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-dndsyry9-83979-tt82n-master-1 higher than 10ms. 0.01s'" file="alert_manager.go:201"
4 jobs for 4.18
Mar 04 job
time="2025-03-04 16:15:55" level=error msg="podLatency: P99 Ready latency (162.00s) higher than configured threshold: 2m0s" file="job.go:164" time="2025-03-04 16:24:20" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-8s43clld-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121" time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:06:16Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-1 higher than 10ms. 0.012s'" file="alert_manager.go:201" time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:07:46Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:201" time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:13:16Z: '10 minutes avg. 99th etcd commit latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-0 higher than 30ms. 0.03s'" file="alert_manager.go:201" time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:06:16Z: '10 minutes avg. 99th mutating API call latency for DELETE/services higher than 1 second. 1.645s'" file="alert_manager.go:201"
Dec 20 job
time="2024-12-20 13:57:27" level=error msg="podLatency: P99 Ready latency (171.00s) higher than configured threshold: 2m0s" file="job.go:212" time="2024-12-20 14:09:17" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-1j9zm8qk-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135" time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:47:33Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-0 higher than 10ms. 0.012s'" file="alert_manager.go:217" time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:51:33Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-2 higher than 10ms. 0.01s'" file="alert_manager.go:217" time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:57:03Z: '10 minutes avg. 99th etcd commit latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-0 higher than 30ms. 0.03s'" file="alert_manager.go:217" time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:56:33Z: '10 minutes avg. 99th mutating API call latency for POST/serviceaccounts higher than 1 second. 1.076s'" file="alert_manager.go:217"
Dec 19 job
time="2024-12-18 19:13:23" level=error msg="podLatency: P99 Ready latency (191.00s) higher than configured threshold: 2m0s" file="job.go:212" time="2024-12-18 19:24:07" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-txyqtv0b-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135" time="2024-12-18 19:24:08" level=warning msg="🚨 alert at 2024-12-18T19:03:41Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-txyqtv0b-9b9f0-9fvzx-master-0 higher than 10ms. 0.012s'" file="alert_manager.go:217"
Dec 18 job
time="2024-12-18 13:15:07" level=error msg="podLatency: P99 Ready latency (122.00s) higher than configured threshold: 2m0s" file="job.go:212" time="2024-12-18 13:23:46" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci- op-1h6bgfsi-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135" time="2024-12-18 13:23:46" level=warning msg="🚨 alert at 2024-12-18T13:10:38Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1h6bgfsi-9b9f0-6b66r-master-1 higher than 10ms. 0.01s'" file="alert_manager.go:217" time="2024-12-18 13:23:46" level=warning msg="🚨 alert at 2024-12-18T13:08:38Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1h6bgfsi-9b9f0-6b66r-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:217"
If we check the 2 runs on Mar 04, for node-density heavy on GCP without FIPS and ETCD encrypt, 4.18 41s is better than 4.17 111s.
4.17 job
P99 Ready latency: 111s
time="2025-03-04 11:27:18" level=info msg="node-density-heavy: Ready 99th: 111000 max: 122000 avg: 29041" file="pod_latency.go:252" time="2025-03-04 11:36:03" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-2spt8sfc-fbfc9.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121" time="2025-03-04 11:36:03" level=warning msg="🚨 alert at 2025-03-04T11:20:34Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-2spt8sfc-fbfc9-vm5nk-master-2 higher than 10ms. 0.01s'" file="alert_manager.go:201" time="2025-03-04 11:36:04" level=warning msg="🚨 alert at 2025-03-04T11:25:04Z: 'ci-op-2spt8sfc-fbfc9-vm5nk-worker-a-5lrr9/kubelet down'" file="alert_manager.go:201"
4.18 job
P99 Ready latency: 41s
time="2025-03-04 11:24:03" level=info msg="node-density-heavy: Ready 99th: 41000 max: 42000 avg: 18696" file="pod_latency.go:252" time="2025-03-04 11:42:33" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-09s5gqrf-834d1.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121" time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:30:45Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-09s5gqrf-834d1-4m67d-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:201" time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:26:45Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-09s5gqrf-834d1-4m67d-master-1 higher than 10ms. 0.016s'" file="alert_manager.go:201" time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:26:45Z: '10 minutes avg. 99th mutating API call latency for DELETE/services higher than 1 second. 1.094s'" file="alert_manager.go:201"
After some analysis,
- The issue does not reproduce on AWS
- The issue does not reproduce on GCP without FIPS-EtcdEncrypt. But it seems the results are not stable: 4.18 41s v.s. 4.17 111s
- On GCP-OVN-FIPS-EtcdEncrypt, the node-density-heavy result is worse than without FIPS-EtcdEncrypt
- On GCP-OVN-FIPS-EtcdEncrypt, node-density-heavy 4.18 is worse than 4.17 from the 2 runs you did. It's better to get more samples to confirm.
- On GCP-OVN-FIPS-EtcdEncrypt, node-density does not have regression, but the results are not stable(majority are stable at 3-4s, but 2 distinguished are 8s and 9s). So it is not likely to be a scheduler issue.
Affected Platforms:
Is it an
- internal RedHat testing failure
- It was tested on Prow in this PR: https://github.com/openshift/release/pull/59993 (job links in the comment)