Uploaded image for project: 'OCP Technical Release Team'
  1. OCP Technical Release Team
  2. TRT-1021

Deploy alert for high worker CPU in CI clusters

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Won't Do
    • Icon: Major Major
    • None
    • None
    • None
    • False
    • None
    • False

      As fallout from OCPBUGS-11591 we want to deploy an alert in CI clusters (not in the product at this point) for high CPU use on workers.

      Ryan from Node team has provided the following promql which should give insight into pod/namespace granularity of the problems:

      100 * (
        sum(rate(container_cpu_usage_seconds_total{container!=""}[5m])) by (node)
          / on(node)
        kube_node_status_capacity{resource="cpu"}
      )
      

      Or:

      sum(rate(container_cpu_usage_seconds_total{}[5m])) by (pod)
      

      First experiment with these promql queries in promecius on past runs, perhaps the bad runs from the parent bug. Determine when we think we should alert.

      Then we need to find a way to deploy an alert into CI clusters during ipi installs, this should be quite global across the fleet.

        1. screenshot-2.png
          screenshot-2.png
          133 kB
        2. screenshot-1.png
          screenshot-1.png
          174 kB
        3. image-2023-06-07-11-08-13-302.png
          image-2023-06-07-11-08-13-302.png
          166 kB

              stbenjam Stephen Benjamin
              rhn-engineering-dgoodwin Devan Goodwin
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: