Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-62968

[4.19 Backport] Improve KubeAggregatedAPIErrors alert in high availability scenarios

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • 4.19.z
    • 4.10.z
    • Monitoring
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • 3
    • None
    • None
    • None
    • None
    • MON Sprint 278
    • 1
    • In Progress
    • Bug Fix
    • Hide
      False alarms reduced for aggregated API errors::
      Before this update, the `KubeAggregatedAPIErrors` alert was based on the sum of errors across all instances of an API. As a consequence, users were more likely to get alerted as the number of instances grew. With this release, alerts are evaluated at the instance level, rather than the API level. As a result, this helps curb false alarms due to the API error threshold getting hit sooner due to being evaluated cluster-wise, rather than instance-wise.
      +
      link:https://issues.redhat.com/browse/OCPBUGS-60691[OCPBUGS-60691]
      Show
      False alarms reduced for aggregated API errors:: Before this update, the `KubeAggregatedAPIErrors` alert was based on the sum of errors across all instances of an API. As a consequence, users were more likely to get alerted as the number of instances grew. With this release, alerts are evaluated at the instance level, rather than the API level. As a result, this helps curb false alarms due to the API error threshold getting hit sooner due to being evaluated cluster-wise, rather than instance-wise. + link: https://issues.redhat.com/browse/OCPBUGS-60691 [ OCPBUGS-60691 ]
    • None
    • None
    • None
    • None

      Description of problem:

      KubeAggregatedAPIErrors was alerting based on the number of errors
      returned by all the instances of an aggregated API which made it more
      likely to fire the more instances the API had. To improve that we need
      to alert at the instance level rather than at the API level by changing
      the aggregation function from a sum to a max.    

      Version-Release number of selected component (if applicable):

          

      How reproducible:

          

      Steps to Reproduce:

          1.
          2.
          3.
          

      Actual results:

          

      Expected results:

          

      Additional info:

      Setting affected version to 4.10.z, since that's as far back as it goes, but this first surfaced in 4.1 (Feb, 2020): https://github.com/dgrisonnet/kubernetes-mixin/commit/966ce6f2a8ce7ceedad32a68e991d13d4ee8474e

              prasriva@redhat.com Pranshu Srivastava
              prasriva@redhat.com Pranshu Srivastava
              None
              None
              Junqi Zhao Junqi Zhao
              Eliska Romanova Eliska Romanova
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: