Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-986

Track down solutions for worker CPU alerts with node team

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      In the fallout of OCPBUGS-11591, we discussed a number of ways we could prevent this in the future, or at least see what component is consuming too much CPU, and when worker nodes are overcommitted. As far as we know, there are no alerts for worker CPU consumption.

      Some of the slack convo:

      Scott Dodson
      2 hours ago
      We've got existing alerts on control plane cpu utilization exceeding 60 and 90%, were those firing or was this only on workers?

      dgoodwin
      :working: 2 hours ago
      iirc only workers

      dgoodwin
      :working: 2 hours ago
      no alerts there?

      Scott Dodson
      2 hours ago
      none that I know of

      dgoodwin
      :working: 2 hours ago
      @Scott Dodson
      do you think we could make a case for monitoring team to add those very soon, maybe as part of reaction plan to a failed 4.next sprint

      dgoodwin
      :working: 2 hours ago
      that seems more logical, i guess why would the network operator need to be monitoring cpu when we could use metrics

      dgoodwin
      :working: 2 hours ago
      although watching for that unreasonably long poll interval and putting up some kind of an alert or warning or operator status might be good

      Scott Dodson
      2 hours ago
      I'm not sure, ideal state is sort of that you run as highly utilized as possible but never at 100%. Maybe we could ask the node team to help us understand if we could put together an alert that indicates when processes are waiting for cpu time over a decent window of time.

      trozet
      2 hours ago
      the OVS Unreasonably long poll intervall... if you see in the parenthesis it says 0 ms user, 0 ms system, then that means OVS cannot get CPU

      trozet
      2 hours ago
      so its one indicator

      trozet
      2 hours ago
      also the load average on teh system as a whole should not exceed the number of physical cores it has

      dgoodwin
      :working: 2 hours ago
      that would be a great degraded = true condition wouldn't it

      trozet
      2 hours ago
      those were the indicators we used to determine in the gcp upgrade case something was hogging CPU

      trozet
      2 hours ago
      we tried to run top every 15 seconds and correlate that to when the cpu starvation happened, but i think it was flawed

      trozet
      2 hours ago
      i think there are several tools out there that will run a daemon in the background and collect per process CPU then youc an go back and review it later

      Scott Dodson
      2 hours ago
      PAO/NTO I believe falls under node and they already manage stalld which seems like it may be useful. It's intended to boost processes staved of cpu time but it has a logging only function which it seems like we could hang a test off of, of course how much cpu time does stalld then consume?

            Unassigned Unassigned
            rhn-engineering-dgoodwin Devan Goodwin
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: