Loading...

XML

Word

Printable

Type: Story
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
CI Flake Reduction
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

We have been relying on high cpu alerts, but this is only for the masters. e2e's appear to be pegging cpu on workers too where a lot of test workloads run, and subsequently tests fail. To correlate you need to load up awkward promecieus links and find the right query to run, and those links can't be shared for long so people are usually then attaching screenshots to bugs.

If the intervals charts could show periods of time where CPU was over say 90 or 95% per node, it would then become trivial to see that e2e's were failing when cpu was pegged. It may also help with correlating tests that are causing it via interval overlap analysis. (prior art here; https://github.com/openshift/origin/pull/29932, perhaps this should be modified to use the new intervals as part of this effort)

There is also prior art for generating intervals based on promql queries, see https://github.com/openshift/origin/blob/b4420417d05dd2fd90c6ca7b2c7e50ef9f39296d/pkg/monitortests/kubeapiserver/apiunreachablefromclientmetrics/monitortest.go or perhaps https://github.com/openshift/origin/blob/b4420417d05dd2fd90c6ca7b2c7e50ef9f39296d/pkg/monitortests/testframework/metricsendpointdown/metricsendpointdown.go#L4

Publishing an autodl artifact with some data on the time spent in high cpu state, by workers and masters, would help us identify jobs that have the problem both now and in the future.

links to

openshift/origin#30152: TRT-2260: Create High CPU intervals for nodes

Assignee:: Ken Zhang

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2025/08/20 12:47 PM

Updated:: 2025/11/21 8:31 PM

Resolved:: 2025/09/02 1:10 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates