Loading...

XML

Word

Printable

Type: Story
Resolution: Obsolete
Priority: Major
Fix Version/s: None
Affects Version/s: None
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Intelligence Requested:
Market:

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

In the fallout of ~~OCPBUGS-11591~~, we discussed a number of ways we could prevent this in the future, or at least see what component is consuming too much CPU, and when worker nodes are overcommitted. As far as we know, there are no alerts for worker CPU consumption.

Some of the slack convo:

Scott Dodson
2 hours ago
We've got existing alerts on control plane cpu utilization exceeding 60 and 90%, were those firing or was this only on workers?

dgoodwin
:working: 2 hours ago
iirc only workers

dgoodwin
:working: 2 hours ago
no alerts there?

Scott Dodson
2 hours ago
none that I know of

dgoodwin
:working: 2 hours ago
@Scott Dodson
do you think we could make a case for monitoring team to add those very soon, maybe as part of reaction plan to a failed 4.next sprint

dgoodwin
:working: 2 hours ago
that seems more logical, i guess why would the network operator need to be monitoring cpu when we could use metrics

dgoodwin
:working: 2 hours ago
although watching for that unreasonably long poll interval and putting up some kind of an alert or warning or operator status might be good

Scott Dodson
2 hours ago
I'm not sure, ideal state is sort of that you run as highly utilized as possible but never at 100%. Maybe we could ask the node team to help us understand if we could put together an alert that indicates when processes are waiting for cpu time over a decent window of time.

trozet
2 hours ago
the OVS Unreasonably long poll intervall... if you see in the parenthesis it says 0 ms user, 0 ms system, then that means OVS cannot get CPU

trozet
2 hours ago
so its one indicator

trozet
2 hours ago
also the load average on teh system as a whole should not exceed the number of physical cores it has

dgoodwin
:working: 2 hours ago
that would be a great degraded = true condition wouldn't it

trozet
2 hours ago
those were the indicators we used to determine in the gcp upgrade case something was hogging CPU

trozet
2 hours ago
we tried to run top every 15 seconds and correlate that to when the cpu starvation happened, but i think it was flawed

trozet
2 hours ago
i think there are several tools out there that will run a daemon in the background and collect per process CPU then youc an go back and review it later

Scott Dodson
2 hours ago
PAO/NTO I believe falls under node and they already manage stalld which seems like it may be useful. It's intended to boost processes staved of cpu time but it has a logging only function which it seems like we could hang a test off of, of course how much cpu time does stalld then consume?

is related to

OCPBUGS-11591 Mass sig-network test failures on GCP OVN

Closed

Assignee:: Unassigned

Reporter:: Devan Goodwin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2023/04/25 5:22 PM

Updated:: 2023/05/15 1:43 PM

Resolved:: 2023/05/15 1:43 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates