-
Bug
-
Resolution: Unresolved
-
Undefined
-
None
-
4.20.z
-
None
-
None
-
False
-
-
None
-
Moderate
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
the alert `AlertmanagerClusterFailedToSendAlerts` is not triggered even though alerting is misconfigured
Version-Release number of selected component (if applicable):
OCP 4.20.4
How reproducible:
Install an OCP 4.20.4 cluster, deliberately mis-configure alertconfig (in this case wrong mail server port) to ensure alerting is failing
Steps to Reproduce:
1. Install OCP 4.10.4
2. mis-configure alerting, e.g. configure mail receiver with wrong mail server port to ensure all alertmanager instances fail to send alerts
3. check for alerts
Actual results:
just showing alerts: - Watchdog - AlertmanagerFailedToSendAlerts
Expected results:
showing in addition to the actual results - AlertmanagerClusterFailedToSendAlerts
Additional info:
it seems the expression for the alert is wrint, in particular the bracket close to the end should be moved to after 0.01:
$ oc get PrometheusRule -o yaml alertmanager-main-rules |grep -A 10 AlertmanagerClusterFailedToSendAlerts
- alert: AlertmanagerClusterFailedToSendAlerts
annotations:
description: The minimum notification failure rate to {{ $labels.integration
}} sent from any instance in the {{$labels.job}} cluster is {{ $value |
humanizePercentage }}.
runbook_url: https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/AlertmanagerClusterFailedToSendAlerts.md
summary: All Alertmanager instances in a cluster failed to send notifications
to a critical integration.
expr: |
min by (namespace,service, integration) (
rate(alertmanager_notifications_failed_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[15m])
/
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[15m])
) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> 0.01 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
for: 5m
changing it to
[...]
ignoring (reason) group_left rate(alertmanager_notifications_total{job=~"alertmanager-main|alertmanager-user-workload", integration=~`.*`}[15m])
> 0.01 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
seems to work