Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-17789

hwlatdetect performance number of recent 4.12 builds degrade from 4.10.45

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • 4.12.z
    • Telco Performance
    • None
    • Important
    • No
    • False
    • Hide

      None

      Show
      None
    • Hide
      9/24: The fix is now in the OCP 4.12.61 release.
      7/25: Additional fix available and backported to 8.6.0.z (RHEL-40165). Need a retest when that is available in an OCP 4.12.z release.
      5/16: An additional fix is likely required - tracked under RHEL-36172
      3/19: Fix available - waiting for backport to 8.6.0.z (RHEL-28550)
      Show
      9/24: The fix is now in the OCP 4.12.61 release. 7/25: Additional fix available and backported to 8.6.0.z (RHEL-40165). Need a retest when that is available in an OCP 4.12.z release. 5/16: An additional fix is likely required - tracked under RHEL-36172 3/19: Fix available - waiting for backport to 8.6.0.z (RHEL-28550)

      Description of problem:

      We run latency tests against "ProLiant DL110 Gen10 Plus" servers with "iLO Firmware Version2.72 Sep 04 2022" in our pipelines. We have a biosettings that give Max 14us with 4.10.45 build.
      
      hwlatdetect:  test duration 21600 seconds
         detector: tracer
         parameters:
              CPU list:          None
              Latency threshold: 8us
              Sample window:      10000000us
              Sample width:      950000us
           Non-sampling period:  9050000us
              Output File:       NoneStarting test
      test finished
      Max Latency: 14us
      Samples recorded: 78
      Samples exceeding threshold: 78
      ts: 1690494043.493348711, inner:14, outer:14
      ts: 1690494094.693333933, inner:14, outer:14
      ts: 1690494483.813099568, inner:14, outer:14
      ts: 1690494575.973091893, inner:0, outer:14
      
      with the same physical server, we run the same test against 4.12.25 builds but  observed high number between 27 ~ 44us.
      
      Thanks Brent R helped take a look.
      
      When he disabled the console driver, we observed inline results:
      
      hwlatdetect:  test duration 43200 seconds
         detector: tracer
         parameters:
              CPU list:          None
              Latency threshold: 8us
              Sample window:      10000000us
              Sample width:      950000us
           Non-sampling period:  9050000us
              Output File:       NoneStarting test
      test finished
      Max Latency: 12us
      Samples recorded: 9
      Samples exceeding threshold: 9
      ts: 1692019543.521718115, inner:2, outer:9
      ts: 1692022728.161717439, inner:9, outer:0
      ts: 1692024397.281718973, inner:2, outer:10
      ts: 1692033500.640716557, inner:2, outer:12
      
      However, after reboot the server and run again, the lat number will be bigger than 20us.
       
      F0815 17:43:06.244852       1 main.go:53] failed to run hwlatdetect command; out: hwlatdetect:  test duration 1800 seconds
         detector: tracer
         parameters:
              Latency threshold: 20us
              Sample window:     10000000us
              Sample width:      950000us
           Non-sampling period:  9050000us
              Output File:       None
      
      Starting test
      test finished
      Max Latency: 27us
      Samples recorded: 1
      Samples exceeding threshold: 1
      ts: 1692120526.644392602, inner:27, outer:17
      
      Create this ticket to tracking this. 
      
      
      

      Version-Release number of selected component (if applicable):

       

      How reproducible:

      Load recent 4.12 builds on ProLiant DL110 Gen10 Plus server. run hwlatdetect test.

      Steps to Reproduce:

      1. disable the console driver, run the test
      2. reboot the server (console driver is running), run the test
      3.
      

      Actual results:

      with driver running, the lat number > 20us

      Expected results:

      with driver running, the lat number < 20us

      Additional info:

      HPE "ProLiant DL110 Gen10 Plus" servers with "iLO Firmware Version2.72 Sep 04 2022"

              bwensley@redhat.com Bart Wensley
              jenchen@redhat.com Jennifer Chen
              Yang Liu Yang Liu
              Jennifer Chen
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: