-
Story
-
Resolution: Done
-
Undefined
-
None
-
None
-
None
-
Quality / Stability / Reliability
-
False
-
-
False
-
None
-
None
-
None
-
None
-
None
We have been relying on high cpu alerts, but this is only for the masters. e2e's appear to be pegging cpu on workers too where a lot of test workloads run, and subsequently tests fail. To correlate you need to load up awkward promecieus links and find the right query to run, and those links can't be shared for long so people are usually then attaching screenshots to bugs.
If the intervals charts could show periods of time where CPU was over say 90 or 95% per node, it would then become trivial to see that e2e's were failing when cpu was pegged. It may also help with correlating tests that are causing it via interval overlap analysis. (prior art here; https://github.com/openshift/origin/pull/29932, perhaps this should be modified to use the new intervals as part of this effort)
There is also prior art for generating intervals based on promql queries, see https://github.com/openshift/origin/blob/b4420417d05dd2fd90c6ca7b2c7e50ef9f39296d/pkg/monitortests/kubeapiserver/apiunreachablefromclientmetrics/monitortest.go or perhaps https://github.com/openshift/origin/blob/b4420417d05dd2fd90c6ca7b2c7e50ef9f39296d/pkg/monitortests/testframework/metricsendpointdown/metricsendpointdown.go#L4
Publishing an autodl artifact with some data on the time spent in high cpu state, by workers and masters, would help us identify jobs that have the problem both now and in the future.