Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-52338

[GCP-OVN-FIPS-EtcdEncrypt] 4.18 P99 Ready latency is longer than 4.17 in node-density-heavy-etcd-fips-24nodes test

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      In GCP-OVN-FIPS-EtcdEncrypt cluster's node-density-heavy test 4.18 P99 Ready latency (162.00s) is longer than 4.17 (132.00s)

      We are executing openshift-qe/ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.17-nightly-x86.yaml#L45 & openshift-qe/ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.18-nightly-x86.yaml#L45

      Reference: https://issues.redhat.com/browse/OCPQE-26730?focusedId=26709679&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-26709679 

      GitHub PR in which the test was executed: https://github.com/openshift/release/pull/59993 (you will find the prow job links in the comments)

      Version-Release number of selected component (if applicable):

      4.17.0-0.nightly-2025-03-03-235758 & 4.18.0-0.nightly-2025-03-03-231444

      How reproducible:

      Observed twice, will do more tests to know if it is reproducible (added the result below in the additional info section).

      Steps to Reproduce:

      1. Install a GCP cluster with 3 masters, 24 workers, 3 infras(m5.xlarge(4vCPU/16 GiB memory))
      2. Run node-density-heavy-etcd-fips-24nodes test (4.17 test link & 4.18 test link)
      3. Check the metrics of P99 Ready latency after the test and compare 4.18 with 4.17

      1. Increase the timeout of this test ocp-qe-perfscale-ci/openshift-qe-ocp-qe-perfscale-ci-main__gcp-4.18-nightly-x86.yaml#L45 as shown in the above PR linked.

      2. Run "make update" and "make jobs" in your terminal and push the changes.

      3. Trigger the prow job by commenting this command "/pj-rehearse pull-ci-openshift-qe-ocp-qe-perfscale-ci-main-gcp-4.18-nightly-x86-node-density-heavy-etcd-fips-24nodes".

      Actual results:

      4.18 P99 Ready latency (162.00s) is longer than 4.17 (132.00s)

      Expected results:

      There is no regression in 4.18 P99 Ready latency

      Additional info:

      For GCP-OVN-FIPS-EtcdEncrypt job runs

      1 job for 4.17

      Mar 04 job

      time="2025-03-04 16:14:23" level=error msg="podLatency: P99 Ready latency (132.00s) higher than configured threshold: 2m0s" file="job.go:164"
      
      time="2025-03-04 16:22:55" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-dndsyry9-83979.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
      time="2025-03-04 16:22:55" level=warning msg="🚨 alert at 2025-03-04T16:05:18Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-dndsyry9-83979-tt82n-master-0 higher than 10ms. 0.011s'" file="alert_manager.go:201"
      time="2025-03-04 16:22:55" level=warning msg="🚨 alert at 2025-03-04T16:09:18Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-dndsyry9-83979-tt82n-master-1 higher than 10ms. 0.01s'" file="alert_manager.go:201"

      4 jobs for 4.18

      Mar 04 job

      time="2025-03-04 16:15:55" level=error msg="podLatency: P99 Ready latency (162.00s) higher than configured threshold: 2m0s" file="job.go:164"
      
      time="2025-03-04 16:24:20" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-8s43clld-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
      time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:06:16Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-1 higher than 10ms. 0.012s'" file="alert_manager.go:201"
      time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:07:46Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:201"
      time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:13:16Z: '10 minutes avg. 99th etcd commit latency on etcd-ci-op-8s43clld-9b9f0-4n4pn-master-0 higher than 30ms. 0.03s'" file="alert_manager.go:201"
      time="2025-03-04 16:24:20" level=warning msg="🚨 alert at 2025-03-04T16:06:16Z: '10 minutes avg. 99th mutating API call latency for DELETE/services higher than 1 second. 1.645s'" file="alert_manager.go:201"

      Dec 20 job

      time="2024-12-20 13:57:27" level=error msg="podLatency: P99 Ready latency (171.00s) higher than configured threshold: 2m0s" file="job.go:212"
      
      time="2024-12-20 14:09:17" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-1j9zm8qk-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135"
      time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:47:33Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-0 higher than 10ms. 0.012s'" file="alert_manager.go:217"
      time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:51:33Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-2 higher than 10ms. 0.01s'" file="alert_manager.go:217"
      time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:57:03Z: '10 minutes avg. 99th etcd commit latency on etcd-ci-op-1j9zm8qk-9b9f0-khjs4-master-0 higher than 30ms. 0.03s'" file="alert_manager.go:217"
      time="2024-12-20 14:09:17" level=warning msg="🚨 alert at 2024-12-20T13:56:33Z: '10 minutes avg. 99th mutating API call latency for POST/serviceaccounts higher than 1 second. 1.076s'" file="alert_manager.go:217"

      Dec 19 job

      time="2024-12-18 19:13:23" level=error msg="podLatency: P99 Ready latency (191.00s) higher than configured threshold: 2m0s" file="job.go:212"
      
      time="2024-12-18 19:24:07" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-txyqtv0b-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135"
      time="2024-12-18 19:24:08" level=warning msg="🚨 alert at 2024-12-18T19:03:41Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-txyqtv0b-9b9f0-9fvzx-master-0 higher than 10ms. 0.012s'" file="alert_manager.go:217"

      Dec 18 job

      time="2024-12-18 13:15:07" level=error msg="podLatency: P99 Ready latency (122.00s) higher than configured threshold: 2m0s" file="job.go:212"
      
      time="2024-12-18 13:23:46" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-
      op-1h6bgfsi-9b9f0.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:135"
      time="2024-12-18 13:23:46" level=warning msg="🚨 alert at 2024-12-18T13:10:38Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1h6bgfsi-9b9f0-6b66r-master-1 higher than 10ms. 0.01s'" file="alert_manager.go:217"
      time="2024-12-18 13:23:46" level=warning msg="🚨 alert at 2024-12-18T13:08:38Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-1h6bgfsi-9b9f0-6b66r-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:217"
      

       

      If we check the 2 runs on Mar 04, for node-density heavy on GCP without FIPS and ETCD encrypt, 4.18 41s is better than 4.17 111s.

      4.17 job

      P99 Ready latency: 111s

      time="2025-03-04 11:27:18" level=info msg="node-density-heavy: Ready 99th: 111000 max: 122000 avg: 29041" file="pod_latency.go:252"
      time="2025-03-04 11:36:03" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-2spt8sfc-fbfc9.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
      time="2025-03-04 11:36:03" level=warning msg="🚨 alert at 2025-03-04T11:20:34Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-2spt8sfc-fbfc9-vm5nk-master-2 higher than 10ms. 0.01s'" file="alert_manager.go:201"
      time="2025-03-04 11:36:04" level=warning msg="🚨 alert at 2025-03-04T11:25:04Z: 'ci-op-2spt8sfc-fbfc9-vm5nk-worker-a-5lrr9/kubelet down'" file="alert_manager.go:201"
      

      4.18 job

      P99 Ready latency: 41s

      time="2025-03-04 11:24:03" level=info msg="node-density-heavy: Ready 99th: 41000 max: 42000 avg: 18696" file="pod_latency.go:252"
      time="2025-03-04 11:42:33" level=info msg="Evaluating alerts for job node-density-heavy in: https://prometheus-k8s-openshift-monitoring.apps.ci-op-09s5gqrf-834d1.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX" file="alert_manager.go:121"
      time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:30:45Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-09s5gqrf-834d1-4m67d-master-0 higher than 10ms. 0.01s'" file="alert_manager.go:201"
      time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:26:45Z: '10 minutes avg. 99th etcd fsync latency on etcd-ci-op-09s5gqrf-834d1-4m67d-master-1 higher than 10ms. 0.016s'" file="alert_manager.go:201"
      time="2025-03-04 11:42:33" level=warning msg="🚨 alert at 2025-03-04T11:26:45Z: '10 minutes avg. 99th mutating API call latency for DELETE/services higher than 1 second. 1.094s'" file="alert_manager.go:201"

       

      After some analysis,

      1. The issue does not reproduce on AWS
      2. The issue does not reproduce on GCP without FIPS-EtcdEncrypt. But it seems the results are not stable: 4.18 41s v.s. 4.17 111s
      3. On GCP-OVN-FIPS-EtcdEncrypt, the node-density-heavy result is worse than without FIPS-EtcdEncrypt
      4. On GCP-OVN-FIPS-EtcdEncrypt, node-density-heavy 4.18 is worse than 4.17 from the 2 runs you did. It's better to get more samples to confirm.
      5. On GCP-OVN-FIPS-EtcdEncrypt, node-density does not have regression, but the results are not stable(majority are stable at 3-4s, but 2 distinguished are 8s and 9s). So it is not likely to be a scheduler issue. 

       

      Affected Platforms:

      Is it an

      1. internal RedHat testing failure

              sdn-team-bot sdn-team bot
              mbhalodi@redhat.com Meha Bhalodiya
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: