Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-50522

4.19 Perf Regression cluster density pod latency sees 4 second increase

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      OCP 4.19 cluster-density-v2 podReadyLatencies regressed significantly.
      
      99th-percentile latencies were stable at 11s, and are now consistently above 15s, sometimes reaching ___
      
      Max latencies sometimes fluctuate, but were also stable. Now sometimes reach 40s.
      
      The stats are:
      Before this change, 99th latency measurements were 11.064s +/- 0.235s , max: 12.676s +/ 2.056s
      After the change, 99th is 15.650s +/- 0.489s , max is 20.591s +/- 6.609s
      
      

      Version-Release number of selected component (if applicable):

      There are multiple change points between several versions.
      Hunter change point detection algorithm picks out
      * the max latency change was introduced between 4.19.0-0.nightly-2025-01-27-130640 and 4.19.0-0.nightly-2025-01-28-090833
      * 99th percentile latency change was introduced between: 4.19.0-0.nightly-2025-01-28-090833 and 4.19.0-0.nightly-2025-01-30-091858.
      
      

      How reproducible:

      100% . These values have been consistently high and unstable to today (Feb 10).
      

      Steps to Reproduce:

      1. Run the payload control plane test in prow:`/pj-rehearse periodic-ci-openshift-qe-ocp-qe-perfscale-ci-main-aws-4.19-nightly-x86-payload-control-plane-6nodes` (or observe the current job history triggered on each nightly build)
      

      Actual results:

      Observe cluster-density-v2 p99 is >=15s and max is between 17-40s
      

      Expected results:

      cluster-density-v2 p99 is 11s and max is 12s
      

      Additional info:

      Our expectation of 11s 99th percentile is not a sensitive threshold. The increase from 11 to 15 indicates significant reduction in throughput throughout the platform, akin to a higher workload density or higher client QPS scaling rate.
      
      We should handle this as a perceptible difference in the user experience and cluster stability.
      

              pehunt@redhat.com Peter Hunt
              ancollin@redhat.com Andrew Collins
              None
              None
              Andrew Collins Andrew Collins
              James Brigman James Brigman
              Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

                Created:
                Updated:
                Resolved: