-
Task
-
Resolution: Done
-
Normal
-
None
-
None
-
None
From the alerting guidelines (https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md), alerts should include a namespace label. While we can't enforce this rule statically, we can use the telemetry data to spot after the fact which alerts don't comply with the guidelines and file bugs against the non-compliant operators.
To find out which alerting rules don't follow the , the steps should look like
1. Spin up a cluster from the latest stable version
2. Query the /api/v1/rules endpoint from the Thanos querier service and extract all the product alert names.
curl https://thanos-querier.../api/v1/rules | jq -cr '.data.groups | map(.rules) | flatten | map(select(.type =="alerting")) | map(.name) | unique |join("|")'
3. From https://telemeter-lts.datahub.redhat.com, extract the list of all product alerts that fired without a namespace label, grouped by minor release.
count by (alertname,version) (alerts{alertname=~"<insert list>",namespace=""} * on(_id) group_left(version) max by(_id, version) (label_replace(id_version_ebs_account_internal:cluster_subscribed{version=~"4.1(2|3|4).*"}, "version", "$1", "version", "^(4.\\d+).*$")))
DoD:
- The procedure above is documented in the CMO repository or in rhobs/handbook.
- OCPBUGS tickets opened against each component that needs to fix their alerts.
- depends on
-
OCPBUGS-10699 Modification of alerts for `Kube*QuotaOvercommit`
- Closed
-
OCPBUGS-17191 Missing namespace label for several CMO alerts
- Closed
- links to