Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-67162

AlertmanagerClusterFailedToSendAlerts never triggers

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Undefined Undefined
    • None
    • 4.20.z
    • Monitoring
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      the alert `AlertmanagerClusterFailedToSendAlerts` is not triggered even though alerting is misconfigured 

      Version-Release number of selected component (if applicable):

      OCP 4.20.4

      How reproducible:

      Install an OCP 4.20.4 cluster, deliberately mis-configure alertconfig (in this case wrong mail server port) to ensure alerting is failing

      Steps to Reproduce:

          1. Install OCP 4.10.4
          2. mis-configure alerting, e.g. configure mail receiver with wrong mail server port to ensure all alertmanager instances fail to send alerts
          3. check for alerts
          

      Actual results:

      just showing alerts:
      - Watchdog
      - AlertmanagerFailedToSendAlerts

      Expected results:

      showing in addition to the actual results
      - AlertmanagerClusterFailedToSendAlerts

      Additional info:

       it seems the expression for the alert is wrint, in particular the bracket close to the end should be moved to after 0.01:
      
      
      $ oc get PrometheusRule -o yaml alertmanager-main-rules  |grep -A 10 AlertmanagerClusterFailedToSendAlerts
          - alert: AlertmanagerClusterFailedToSendAlerts
            annotations:
              description: The minimum notification failure rate to {{ $labels.integration
                }} sent from any instance in the {{$labels.job}} cluster is {{ $value |
                humanizePercentage }}.
              runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerClusterFailedToSendAlerts.md
              summary: All Alertmanager instances in a cluster failed to send notifications
                to a critical integration.
            expr: |
              min by (namespace,service, integration) (
                rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[15m])
              /
                ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[15m])
              )                 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
              > 0.01            <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
            for: 5m
      
      
      changing it to
      
       [...]
                ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[15m])
              > 0.01            <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
              ) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
      
      seems to work 

              juzhao@redhat.com Junqi Zhao
              rhn-support-dmoessner Daniel Moessner
              None
              None
              Junqi Zhao Junqi Zhao
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: