Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-2907

Improve KubeCPUOvercommit alert

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • False
    • None
    • False
    • NEW
    • NEW

      trking pointed out that the KubeCPUOvercommit alert has a gap.

      https://coreos.slack.com/archives/C0VMT03S5/p1671424730231919

      It does not take recent addition to the CPU capacity into account. This can lead to a situation where well timed load additions keep this alert firing even if the mitigations via autoscaling work just fine.
      This makes the alert noisy.

      The first proposal to extend the for clause would only fix one case (2 workload increases 5 minutes apart) but not more. So it would be preferable to improve the alert expression.

      To quote Trevor: "if we are within a node's worth of CPU for 10m and the total CPU capacity hasn't increased over that 10m" would be a better trigger.

              rh-ee-amrini Ayoub Mrini
              jfajersk@redhat.com Jan Fajerski
              Junqi Zhao Junqi Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated: