Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-37271

PSI causing latency issues

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done-Errata
    • Icon: Undefined Undefined
    • None
    • 4.16
    • Important
    • None
    • False
    • Hide

      None

      Show
      None
    • Hide
      * Previously, in a cluster that runs {product-title} 4.16 with the Telco RAN DU reference configuration, long duration `cyclictest` or `timerlat` tests could fail with maximum latencies detected above `20`us. This issue occured because the `psi` kernel command line argument was being set to `1` by default when cgroup v2 is enabled. With this release, the issue is fixed by setting `psi=0` in the kernel arguments when enabling cgroup v2. The `cyclictest` latency issue reported in link:https://issues.redhat.com/browse/OCPBUGS-34022[*OCPBUGS-34022*] is now also fixed. (link:https://issues.redhat.com/browse/OCPBUGS-37271[*OCPBUGS-37271*])
      Show
      * Previously, in a cluster that runs {product-title} 4.16 with the Telco RAN DU reference configuration, long duration `cyclictest` or `timerlat` tests could fail with maximum latencies detected above `20`us. This issue occured because the `psi` kernel command line argument was being set to `1` by default when cgroup v2 is enabled. With this release, the issue is fixed by setting `psi=0` in the kernel arguments when enabling cgroup v2. The `cyclictest` latency issue reported in link: https://issues.redhat.com/browse/OCPBUGS-34022 [* OCPBUGS-34022 *] is now also fixed. (link: https://issues.redhat.com/browse/OCPBUGS-37271 [* OCPBUGS-37271 *])
    • Bug Fix
    • Done

      Description of problem:

      In debugging recent cyclictest issues on OCP 4.16 (5.14.0-427.22.1.el9_4.x86_64+rt kernel), we have discovered that the "psi=1" kernel cmdline argument, which is now added by default due to cgroupsv2 being enabled, is causing latency issues (both cyclictest and timerlat are failing to meet the latency KPIs we commit to for Telco RAN DU deployments). See RHEL-42737 for reference.
      

      Version-Release number of selected component (if applicable):

      OCP 4.16

      How reproducible:

      Cyclictest and timerlat consistently fail on long duration runs (e.g. 12 hours).

      Steps to Reproduce:

          1. Install OCP 4.16 and configure with the Telco RAN DU reference configuration.
          2. Run a long duration cyclictest or timerlat test    

      Actual results:

      Maximum latencies are detected above 20us.

      Expected results:

      All latencies are below 20us.

      Additional info:

      See RHEL-42737 for test results and debugging information. This was originally suspected to be an RHEL issue, but it turns out that PSI is being enabled by OpenShift code (which adds psi=1 to the kernel cmdline).

              pehunt@redhat.com Peter Hunt
              bwensley@redhat.com Bart Wensley
              Sergio Regidor de la Rosa Sergio Regidor de la Rosa
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: