-
Story
-
Resolution: Obsolete
-
Major
-
None
-
None
-
None
-
False
-
None
-
False
-
-
In the fallout of OCPBUGS-11591, we discussed a number of ways we could prevent this in the future, or at least see what component is consuming too much CPU, and when worker nodes are overcommitted. As far as we know, there are no alerts for worker CPU consumption.
Some of the slack convo:
Scott Dodson
2 hours ago
We've got existing alerts on control plane cpu utilization exceeding 60 and 90%, were those firing or was this only on workers?
dgoodwin
:working: 2 hours ago
iirc only workers
dgoodwin
:working: 2 hours ago
no alerts there?
Scott Dodson
2 hours ago
none that I know of
dgoodwin
:working: 2 hours ago
@Scott Dodson
do you think we could make a case for monitoring team to add those very soon, maybe as part of reaction plan to a failed 4.next sprint
dgoodwin
:working: 2 hours ago
that seems more logical, i guess why would the network operator need to be monitoring cpu when we could use metrics
dgoodwin
:working: 2 hours ago
although watching for that unreasonably long poll interval and putting up some kind of an alert or warning or operator status might be good
Scott Dodson
2 hours ago
I'm not sure, ideal state is sort of that you run as highly utilized as possible but never at 100%. Maybe we could ask the node team to help us understand if we could put together an alert that indicates when processes are waiting for cpu time over a decent window of time.
trozet
2 hours ago
the OVS Unreasonably long poll intervall... if you see in the parenthesis it says 0 ms user, 0 ms system, then that means OVS cannot get CPU
trozet
2 hours ago
so its one indicator
trozet
2 hours ago
also the load average on teh system as a whole should not exceed the number of physical cores it has
dgoodwin
:working: 2 hours ago
that would be a great degraded = true condition wouldn't it
trozet
2 hours ago
those were the indicators we used to determine in the gcp upgrade case something was hogging CPU
trozet
2 hours ago
we tried to run top every 15 seconds and correlate that to when the cpu starvation happened, but i think it was flawed
trozet
2 hours ago
i think there are several tools out there that will run a daemon in the background and collect per process CPU then youc an go back and review it later
Scott Dodson
2 hours ago
PAO/NTO I believe falls under node and they already manage stalld which seems like it may be useful. It's intended to boost processes staved of cpu time but it has a logging only function which it seems like we could hang a test off of, of course how much cpu time does stalld then consume?
- is related to
-
OCPBUGS-11591 Mass sig-network test failures on GCP OVN
- Closed