Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-2260

Add intervals for high CPU

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • False
    • None
    • None
    • None
    • None

      We have been relying on high cpu alerts, but this is only for the masters. e2e's appear to be pegging cpu on workers too where a lot of test workloads run, and subsequently tests fail. To correlate you need to load up awkward promecieus links and find the right query to run, and those links can't be shared for long so people are usually then attaching screenshots to bugs.

      If the intervals charts could show periods of time where CPU was over say 90 or 95% per node, it would then become trivial to see that e2e's were failing when cpu was pegged. It may also help with correlating tests that are causing it via interval overlap analysis. (prior art here; https://github.com/openshift/origin/pull/29932, perhaps this should be modified to use the new intervals as part of this effort)

      There is also prior art for generating intervals based on promql queries, see https://github.com/openshift/origin/blob/b4420417d05dd2fd90c6ca7b2c7e50ef9f39296d/pkg/monitortests/kubeapiserver/apiunreachablefromclientmetrics/monitortest.go or perhaps https://github.com/openshift/origin/blob/b4420417d05dd2fd90c6ca7b2c7e50ef9f39296d/pkg/monitortests/testframework/metricsendpointdown/metricsendpointdown.go#L4

      Publishing an autodl artifact with some data on the time spent in high cpu state, by workers and masters, would help us identify jobs that have the problem both now and in the future.

              kenzhang@redhat.com Ken Zhang
              rhn-engineering-dgoodwin Devan Goodwin
              None
              None
              None
              None
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: