Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-30795

cpu_util test prometheus queries intermittently return empty query results

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.15, 4.15.z, 4.16
    • Telco Performance
    • Moderate
    • No
    • CNF Ran Sprint 252, CNF Ran Sprint 253, CNF Ran Sprint 254
    • 3
    • False
    • Hide

      None

      Show
      None
    • Hide
      7/25: Still waiting for reproduction, but current suspicion is this is specific to the process-exporter which is not a customer deliverable.
      Show
      7/25: Still waiting for reproduction, but current suspicion is this is specific to the process-exporter which is not a customer deliverable.

      Description of problem:

      when cpu_util test runs, sometimes prometheus starts returning empty query results. It appears to be influenced by the workload size (percentage of isolated CPUs).
      
      For SPR-EE bm, 85% workload seems to work fine
      For Ice Lake bm, prometheus often works properly, but not always, at a much lower 40% workload
      
      Once prometheus starts returning empty results, the pods must be restarted for subsequent queries to return results, given new data has been gathered since restarting the pods.
      
      This seems to start with the must-gather stage of the test suite most of the time.
      
      Symptom in the Jenkins job log: 
      
      2024/03/08 17:58:52 run command 'oc [adm must-gather]'
      2024/03/08 18:08:09 Command in prom pod: [bash -c curl "-s" 'http://localhost:9090/api/v1/query' --data-urlencode 'query=max_over_time((sum(namedprocess_namegroup_cpu_rate{groupname!~"conmon"})+sum(pod:container_cpu_usage:sum{pod!~"process-exp.*",pod!~"oslat.*",pod!~"stress.*",pod!~"cnfgotestpriv.*"}))[9m26s:30s])'; echo]
      2024/03/08 18:08:09 output: {"status":"success","data":{"resultType":"vector","result":[]}}
      2024/03/08 18:08:09 Query max over time total mgmt cpu usage
      2024/03/08 18:08:09 Must-gather dirs to be removed: [/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/must-gather.local.4674322353555706431]
        [PANICKED] Test Panicked
        In [It] at: /usr/local/go/src/runtime/panic.go:113 @ 03/08/24 18:08:09.407
      
        runtime error: index out of range [0] with length 0
      
        Full Stack Trace
          gitlab.cee.redhat.com/cnf/cnf-gotests/test/ran/cpu/tests.checkCPUUsage(0x83fb46d7e8, 0x4, {0x3699cc0, 0xa})
          	/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:200 +0x16af
          gitlab.cee.redhat.com/cnf/cnf-gotests/test/ran/cpu/tests.glob..func1.5.4.2()
          	/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:156 +0x165
        < Exit [It] should use less than 2 core(s) - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:144 @ 03/08/24 18:08:09.408 (9m27.397s)
        > Enter [AfterEach] Management CPU utilization with workload pods running - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:115 @ 03/08/24 18:08:09.408
        < Exit [AfterEach] Management CPU utilization with workload pods running - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:115 @ 03/08/24 18:08:09.408 (0s)
        > Enter [ReportAfterEach] TOP-LEVEL - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/cpu_suite_test.go:66 @ 03/08/24 18:08:09.408
        < Exit [ReportAfterEach] TOP-LEVEL - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/cpu_suite_test.go:66 @ 03/08/24 18:08:09.408 (0s)
      • [PANICKED] [567.732 seconds]
      
      empty results can be confirmed when running the query manually in the prometheus-k8s pod.
      
          

      Version-Release number of selected component (if applicable):

          As far back at least to 4.15.0-rc.2 and forward into 4.16 currently nightlies
          

      How reproducible:

          Always or often depending on workload percentage and bm used
          

      Steps to Reproduce:

          1. Deploy SNO with Telco DU profile
          2. Run cpu_util test 
          3. Observe test logs to monitor for error conditions that occur after prometheus on the spoke starts returning empty results.  
          

      Actual results:

      prometheus queries stop responding which prevents metrics gathering by the test suite
          

      Expected results:

      prometheus query should always work with a large workload ~85% or so
          

      Additional info:

      
          

              rh-ee-apalanis Abraham Miller
              rhn-support-dgonyier Dwaine Gonyier
              Dwaine Gonyier Dwaine Gonyier
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: