Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.15, 4.15.z, 4.16
Component/s: Telco Performance
Labels:
- cnf-vran:ztp

Severity:
Moderate
Regression:
No
Sprint:
CNF Ran Sprint 252, CNF Ran Sprint 253, CNF Ran Sprint 254
sprint_count:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Latest Status Summary:

Hide
7/25: Still waiting for reproduction, but current suspicion is this is specific to the process-exporter which is not a customer deliverable.

Show
7/25: Still waiting for reproduction, but current suspicion is this is specific to the process-exporter which is not a customer deliverable.
RH Private Keywords:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

when cpu_util test runs, sometimes prometheus starts returning empty query results. It appears to be influenced by the workload size (percentage of isolated CPUs).

For SPR-EE bm, 85% workload seems to work fine
For Ice Lake bm, prometheus often works properly, but not always, at a much lower 40% workload

Once prometheus starts returning empty results, the pods must be restarted for subsequent queries to return results, given new data has been gathered since restarting the pods.

This seems to start with the must-gather stage of the test suite most of the time.

Symptom in the Jenkins job log: 

2024/03/08 17:58:52 run command 'oc [adm must-gather]'
2024/03/08 18:08:09 Command in prom pod: [bash -c curl "-s" 'http://localhost:9090/api/v1/query' --data-urlencode 'query=max_over_time((sum(namedprocess_namegroup_cpu_rate{groupname!~"conmon"})+sum(pod:container_cpu_usage:sum{pod!~"process-exp.*",pod!~"oslat.*",pod!~"stress.*",pod!~"cnfgotestpriv.*"}))[9m26s:30s])'; echo]
2024/03/08 18:08:09 output: {"status":"success","data":{"resultType":"vector","result":[]}}
2024/03/08 18:08:09 Query max over time total mgmt cpu usage
2024/03/08 18:08:09 Must-gather dirs to be removed: [/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/must-gather.local.4674322353555706431]
  [PANICKED] Test Panicked
  In [It] at: /usr/local/go/src/runtime/panic.go:113 @ 03/08/24 18:08:09.407

  runtime error: index out of range [0] with length 0

  Full Stack Trace
    gitlab.cee.redhat.com/cnf/cnf-gotests/test/ran/cpu/tests.checkCPUUsage(0x83fb46d7e8, 0x4, {0x3699cc0, 0xa})
    	/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:200 +0x16af
    gitlab.cee.redhat.com/cnf/cnf-gotests/test/ran/cpu/tests.glob..func1.5.4.2()
    	/var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:156 +0x165
  < Exit [It] should use less than 2 core(s) - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:144 @ 03/08/24 18:08:09.408 (9m27.397s)
  > Enter [AfterEach] Management CPU utilization with workload pods running - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:115 @ 03/08/24 18:08:09.408
  < Exit [AfterEach] Management CPU utilization with workload pods running - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/tests/sno_cpu_utilization.go:115 @ 03/08/24 18:08:09.408 (0s)
  > Enter [ReportAfterEach] TOP-LEVEL - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/cpu_suite_test.go:66 @ 03/08/24 18:08:09.408
  < Exit [ReportAfterEach] TOP-LEVEL - /var/lib/jenkins/workspace/ocp-far-edge-vran-tests/cnf-gotests/test/ran/cpu/cpu_suite_test.go:66 @ 03/08/24 18:08:09.408 (0s)
• [PANICKED] [567.732 seconds]

empty results can be confirmed when running the query manually in the prometheus-k8s pod.

Version-Release number of selected component (if applicable):

    As far back at least to 4.15.0-rc.2 and forward into 4.16 currently nightlies

How reproducible:

    Always or often depending on workload percentage and bm used

Steps to Reproduce:

    1. Deploy SNO with Telco DU profile
    2. Run cpu_util test 
    3. Observe test logs to monitor for error conditions that occur after prometheus on the spoke starts returning empty results.

Actual results:

prometheus queries stop responding which prevents metrics gathering by the test suite

Expected results:

prometheus query should always work with a large workload ~85% or so

Additional info:

Assignee:: Abraham Miller

Reporter:: Dwaine Gonyier

QA Contact:: Dwaine Gonyier

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2024/03/11 11:35 PM

Updated:: 2025/01/30 2:21 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates