Loading...

XML

Word

Printable

Type: Story
Resolution: Won't Do
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

As fallout from ~~OCPBUGS-11591~~ we want to deploy an alert in CI clusters (not in the product at this point) for high CPU use on workers.

Ryan from Node team has provided the following promql which should give insight into pod/namespace granularity of the problems:

100 * (
  sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
    / on(node)
  kube_node_status_capacity{resource="cpu"}
)

Or:

sum(rate(container_cpu_usage_seconds_total{}[5m])) by (pod)

First experiment with these promql queries in promecius on past runs, perhaps the bad runs from the parent bug. Determine when we think we should alert.

Then we need to find a way to deploy an alert into CI clusters during ipi installs, this should be quite global across the fleet.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2023-06-07-11-08-13-302.png
166 kB
2023/06/07 3:08 PM
screenshot-1.png
174 kB
2023/06/07 3:03 PM
screenshot-2.png
133 kB
2023/06/07 3:04 PM

is related to

OCPBUGS-11591 Mass sig-network test failures on GCP OVN

Closed

Assignee:: Stephen Benjamin

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2023/05/09 4:06 PM

Updated:: 2023/07/26 1:07 PM

Resolved:: 2023/07/26 1:07 PM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates