Loading...

XML

Word

Printable

Type: Story
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Epic Link:
None
Story Points:
None

Target Version:
None
Release Blocker:
None
Sprint:
None

As fallout from ~~OCPBUGS-11591~~ we want to deploy an alert in CI clusters (not in the product at this point) for high CPU use on workers.

Ryan from Node team has provided the following promql which should give insight into pod/namespace granularity of the problems:

100 * (
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
    / on(node)
  kube_node_status_capacity{resource="cpu"}
)

Or:

sum(rate(container_cpu_usage_seconds_total{}[5m])) by (pod)

First experiment with these promql queries in promecius on past runs, perhaps the bad runs from the parent bug. Determine when we think we should alert.

Then we need to find a way to deploy an alert into CI clusters during ipi installs, this should be quite global across the fleet.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screenshot-2.png
133 kB
2023/06/07 3:04 PM
screenshot-1.png
174 kB
2023/06/07 3:03 PM
image-2023-06-07-11-08-13-302.png
166 kB
2023/06/07 3:08 PM

is related to

OCPBUGS-11591 Mass sig-network test failures on GCP OVN

Closed

Assignee:: Stephen Benjamin

Reporter:: Devan Goodwin

Need Info From:: None

Contributors:: None

QA Contact:: None

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/05/09 4:06 PM

Updated:: 2023/07/26 1:07 PM

Resolved:: 2023/07/26 1:07 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide