Uploaded image for project: 'Observability Documentation'
  1. Observability Documentation
  2. OBSDOCS-185

unsuitable repeat_interval samples in alertmanager

    XMLWordPrintable

Details

    • 8
    • False
    • False
    • OBSDOCS (May 30-June 20) #237, OBSDOCS (June 20-July 10) #238

    Description

      Description of problem:

      https://docs.openshift.com/container-platform/4.11/monitoring/managing-alerts.html#applying-custom-alertmanager-configuration_managing-alerts
      https://access.redhat.com/documentation/en-us/openshift_container_platform/4.11/html/monitoring/managing-alerts#applying-custom-alertmanager-configuration_managing-alerts

      A sample "Applying a custom Alertmanager configuration" looks like this:

      alertmanager.yaml
      ```
      global:
        resolve_timeout: 5m
      route:
        group_wait: 30s
        group_interval: 5m         <== here
        repeat_interval: 12h
        receiver: default
        routes:
        - match:
            alertname: Watchdog
          repeat_interval: 5m      <== here
          receiver: watchdog
      ... snip ...
      ```

       This setting has the same value for group_interval and repeat_interval and is known to cause race conditions. So this actually sends alerts randomly at 5 or 10 minutes.This creates unnecessary confusion for customers. Please update avoid race conditions.

      Version-Release number of selected component (if applicable):

      4.ll

      Actual results:

        - match:
            alertname: Watchdog
          repeat_interval: 5m
          receiver: watchdog 

      This sample alerts in 5 or 10 minutes randomly.

      Expected results:

      Need a sample that correctly alerts me every 5 minutes.

      We know from some investigate and customer feedback that neither of the following patterns will trigger alerts exactly every 5 minutes.

        - match:
            alertname: Watchdog
          repeat_interval: 1h
          receiver: watchdog 

      and

        - match:
            alertname: Watchdog
          group_wait: 30s
          group_interval: 1m
          repeat_interval: 5m
          receiver: watchdog 
      • If this is an implementation-based limitation, there should at least be a warning in the documentation.

      Additional info:

      Need QE that make sure race conditions are not occurring
      

       

      Attachments

        Issue Links

          Activity

            People

              rhn-support-bburt Brian Burt
              rhn-support-kahara Kazuhisa Hara
              Votes:
              1 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: