Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Networking / ovn-kubernetes
Labels:
- OCPPerfScaleQE

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

In GCP-OVN-FIPS-EtcdEncrypt cluster's node-density-heavy test 4.18 P99 Ready latency (162.00s) is longer than 4.17 (132.00s)

We are executing openshift-qe/ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.17-nightly-x86.yaml#L45 & openshift-qe/ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.18-nightly-x86.yaml#L45

Reference: https://issues.redhat.com/browse/OCPQE-26730?focusedId=26709679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26709679

GitHub PR in which the test was executed: https://github.com/openshift/release/pull/59993 (you will find the prow job links in the comments)

Version-Release number of selected component (if applicable):

4.17.0-0.nightly-2025-03-03-235758 & 4.18.0-0.nightly-2025-03-03-231444

How reproducible:

Observed twice, will do more tests to know if it is reproducible (added the result below in the additional info section).

Steps to Reproduce:

Install a GCP cluster with 3 masters, 24 workers, 3 infras(m5.xlarge(4vCPU/16 GiB memory))
Run node-density-heavy-etcd-fips-24nodes test (4.17 test link & 4.18 test link)
Check the metrics of P99 Ready latency after the test and compare 4.18 with 4.17

1. Increase the timeout of this test ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.18-nightly-x86.yaml#L45 as shown in the above PR linked.

2. Run "make update" and "make jobs" in your terminal and push the changes.

3. Trigger the prow job by commenting this command "/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-gcp-4.18-nightly-x86-node-density-heavy-etcd-fips-24nodes".

Actual results:

4.18 P99 Ready latency (162.00s) is longer than 4.17 (132.00s)

Expected results:

There is no regression in 4.18 P99 Ready latency

Additional info:

For GCP-OVN-FIPS-EtcdEncrypt job runs

1 job for 4.17

Mar 04 job

time="2025-03-04 16:14:23" level=error msg="podLatency: P99 Ready latency (132.00s) higher than configured threshold: 2m0s" file="job.go:164"

time="2025-03-04 16:22:55" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-dndsyry9-83979.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
time="2025-03-04 16:22:55" level=warning msg="🚨 alert at 2025-03-04T16:05:18Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-dndsyry9-83979-tt82n-master-0 higher than 10ms. 0.011s'" file="alert_manager.go:201"
time="2025-03-04 16:22:55" level=warning msg="🚨 alert at 2025-03-04T16:09:18Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-dndsyry9-83979-tt82n-master-1 higher than 10ms. 0.01s'" file="alert_manager.go:201"

4 jobs for 4.18

Mar 04 job

time="2025-03-04 16:15:55" level=error msg="podLatency: P99 Ready latency (162.00s) higher than configured threshold: 2m0s" file="job.go:164"

time="2025-03-04 16:24:20" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-8s43clld-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:06:16Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-1 higher than 10ms. 0.012s'" file="alert_manager.go:201"
time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:07:46Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:201"
time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:13:16Z: '10 minutes avg. 99th etcd commit latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-0 higher than 30ms. 0.03s'" file="alert_manager.go:201"
time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:06:16Z: '10 minutes avg. 99th mutating API call latency for DELETE/services higher than 1 second. 1.645s'" file="alert_manager.go:201"

Dec 20 job

time="2024-12-20 13:57:27" level=error msg="podLatency: P99 Ready latency (171.00s) higher than configured threshold: 2m0s" file="job.go:212"

time="2024-12-20 14:09:17" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-1j9zm8qk-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135"
time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:47:33Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-0 higher than 10ms. 0.012s'" file="alert_manager.go:217"
time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:51:33Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-2 higher than 10ms. 0.01s'" file="alert_manager.go:217"
time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:57:03Z: '10 minutes avg. 99th etcd commit latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-0 higher than 30ms. 0.03s'" file="alert_manager.go:217"
time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:56:33Z: '10 minutes avg. 99th mutating API call latency for POST/serviceaccounts higher than 1 second. 1.076s'" file="alert_manager.go:217"

Dec 19 job

time="2024-12-18 19:13:23" level=error msg="podLatency: P99 Ready latency (191.00s) higher than configured threshold: 2m0s" file="job.go:212"

time="2024-12-18 19:24:07" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-txyqtv0b-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135"
time="2024-12-18 19:24:08" level=warning msg="🚨 alert at 2024-12-18T19:03:41Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-txyqtv0b-9b9f0-9fvzx-master-0 higher than 10ms. 0.012s'" file="alert_manager.go:217"

Dec 18 job

time="2024-12-18 13:15:07" level=error msg="podLatency: P99 Ready latency (122.00s) higher than configured threshold: 2m0s" file="job.go:212"

time="2024-12-18 13:23:46" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-
op-1h6bgfsi-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135"
time="2024-12-18 13:23:46" level=warning msg="🚨 alert at 2024-12-18T13:10:38Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1h6bgfsi-9b9f0-6b66r-master-1 higher than 10ms. 0.01s'" file="alert_manager.go:217"
time="2024-12-18 13:23:46" level=warning msg="🚨 alert at 2024-12-18T13:08:38Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1h6bgfsi-9b9f0-6b66r-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:217"

If we check the 2 runs on Mar 04, for node-density heavy on GCP without FIPS and ETCD encrypt, 4.18 41s is better than 4.17 111s.

4.17 job

P99 Ready latency: 111s

time="2025-03-04 11:27:18" level=info msg="node-density-heavy: Ready 99th: 111000 max: 122000 avg: 29041" file="pod_latency.go:252"
time="2025-03-04 11:36:03" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-2spt8sfc-fbfc9.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
time="2025-03-04 11:36:03" level=warning msg="🚨 alert at 2025-03-04T11:20:34Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-2spt8sfc-fbfc9-vm5nk-master-2 higher than 10ms. 0.01s'" file="alert_manager.go:201"
time="2025-03-04 11:36:04" level=warning msg="🚨 alert at 2025-03-04T11:25:04Z: 'ci-op-2spt8sfc-fbfc9-vm5nk-worker-a-5lrr9/kubelet down'" file="alert_manager.go:201"

4.18 job

P99 Ready latency: 41s

time="2025-03-04 11:24:03" level=info msg="node-density-heavy: Ready 99th: 41000 max: 42000 avg: 18696" file="pod_latency.go:252"
time="2025-03-04 11:42:33" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-09s5gqrf-834d1.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:30:45Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-09s5gqrf-834d1-4m67d-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:201"
time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:26:45Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-09s5gqrf-834d1-4m67d-master-1 higher than 10ms. 0.016s'" file="alert_manager.go:201"
time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:26:45Z: '10 minutes avg. 99th mutating API call latency for DELETE/services higher than 1 second. 1.094s'" file="alert_manager.go:201"

After some analysis,

The issue does not reproduce on AWS
The issue does not reproduce on GCP without FIPS-EtcdEncrypt. But it seems the results are not stable: 4.18 41s v.s. 4.17 111s
On GCP-OVN-FIPS-EtcdEncrypt, the node-density-heavy result is worse than without FIPS-EtcdEncrypt
On GCP-OVN-FIPS-EtcdEncrypt, node-density-heavy 4.18 is worse than 4.17 from the 2 runs you did. It's better to get more samples to confirm.
On GCP-OVN-FIPS-EtcdEncrypt, node-density does not have regression, but the results are not stable(majority are stable at 3-4s, but 2 distinguished are 8s and 9s). So it is not likely to be a scheduler issue.

Affected Platforms:

Is it an

internal RedHat testing failure

It was tested on Prow in this PR: https://github.com/openshift/release/pull/59993 (job links in the comment)

Assignee:: sdn-team bot

Reporter:: Meha Bhalodiya

QA Contact:: Anurag Saxena

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/03/05 7:51 AM

Updated:: 2025/09/03 7:00 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates