-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.15, 4.15.z, 4.16
-
Moderate
-
No
-
CNF Ran Sprint 252, CNF Ran Sprint 253, CNF Ran Sprint 254
-
3
-
False
-
-
-
Description of problem:
when cpu_util test runs, sometimes prometheus starts returning empty query results. It appears to be influenced by the workload size (percentage of isolated CPUs). For SPR-EE bm, 85% workload seems to work fine For Ice Lake bm, prometheus often works properly, but not always, at a much lower 40% workload Once prometheus starts returning empty results, the pods must be restarted for subsequent queries to return results, given new data has been gathered since restarting the pods. This seems to start with the must-gather stage of the test suite most of the time. Symptom in the Jenkins job log: 2024/03/08 17:58:52 run command 'oc [adm must-gather]' 2024/03/08 18:08:09 Command in prom pod: [bash -c curl "-s" 'http://localhost:9090/api/v1/query' --data-urlencode 'query=max_over_time((sum(namedprocess_namegroup_cpu_rate{groupname!~"conmon"})+sum(pod:container_cpu_usage:sum{pod!~"process-exp.*",pod!~"oslat.*",pod!~"stress.*",pod!~"cnfgotestpriv.*"}))[9m26s:30s])'; echo] 2024/03/08 18:08:09 output: {"status":"success","data":{"resultType":"vector","result":[]}} 2024/03/08 18:08:09 Query max over time total mgmt cpu usage 2024/03/08 18:08:09 Must-gather dirs to be removed: [/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/must-gather.local.4674322353555706431] [PANICKED] Test Panicked In [It] at: /usr/local/go/src/runtime/panic.go:113 @ 03/08/24 18:08:09.407 runtime error: index out of range [0] with length 0 Full Stack Trace gitlab.cee.redhat.com/cnf/cnf-gotests/test/ran/cpu/tests.checkCPUUsage(0x83fb46d7e8, 0x4, {0x3699cc0, 0xa}) /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:200 +0x16af gitlab.cee.redhat.com/cnf/cnf-gotests/test/ran/cpu/tests.glob..func1.5.4.2() /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:156 +0x165 < Exit [It] should use less than 2 core(s) - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:144 @ 03/08/24 18:08:09.408 (9m27.397s) > Enter [AfterEach] Management CPU utilization with workload pods running - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:115 @ 03/08/24 18:08:09.408 < Exit [AfterEach] Management CPU utilization with workload pods running - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:115 @ 03/08/24 18:08:09.408 (0s) > Enter [ReportAfterEach] TOP-LEVEL - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/cpu_suite_test.go:66 @ 03/08/24 18:08:09.408 < Exit [ReportAfterEach] TOP-LEVEL - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/cpu_suite_test.go:66 @ 03/08/24 18:08:09.408 (0s) • [PANICKED] [567.732 seconds] empty results can be confirmed when running the query manually in the prometheus-k8s pod.
Version-Release number of selected component (if applicable):
As far back at least to 4.15.0-rc.2 and forward into 4.16 currently nightlies
How reproducible:
Always or often depending on workload percentage and bm used
Steps to Reproduce:
1. Deploy SNO with Telco DU profile 2. Run cpu_util test 3. Observe test logs to monitor for error conditions that occur after prometheus on the spoke starts returning empty results.
Actual results:
prometheus queries stop responding which prevents metrics gathering by the test suite
Expected results:
prometheus query should always work with a large workload ~85% or so
Additional info: