Uploaded image for project: 'OpenShift Monitoring'
  1. OpenShift Monitoring
  2. MON-2907

Improve KubeCPUOvercommit alert

XMLWordPrintable

    • Icon: Story Story
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • None
    • False
    • None
    • False
    • NEW
    • NEW
    • 0

      trking pointed out that the KubeCPUOvercommit alert has a gap.

      https://coreos.slack.com/archives/C0VMT03S5/p1671424730231919

      It does not take recent addition to the CPU capacity into account. This can lead to a situation where well timed load additions keep this alert firing even if the mitigations via autoscaling work just fine.
      This makes the alert noisy.

      The first proposal to extend the for clause would only fix one case (2 workload increases 5 minutes apart) but not more. So it would be preferable to improve the alert expression.

      To quote Trevor: "if we are within a node's worth of CPU for 10m and the total CPU capacity hasn't increased over that 10m" would be a better trigger.

            rh-ee-amrini Ayoub Mrini
            jfajersk@redhat.com Jan Fajerski
            Junqi Zhao Junqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: