-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
-
No
-
---
-
---
-
-
-
MK - Sprint 231
WHAT
Identify why the KafkaColocatingBrokers alert is firing so frequently.
WHY
During Oct. 21 ~ Oct. 28, the `KafkaColocatingBrokers` alert fired 93 times but seems to resolve itself. This is unexpected behaviour because brokers can't move between AZs. This is because the storage volumes for individual brokers is tied to an AZ. So we would not expect this alert to fire and resolve itself.
How
Investigate when this alert has fired, identify the cause and fix if possible.
The alert is coded here: https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/prometheus-rules.yaml#L160
There is a unit test here: https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/unit_tests/KafkaColocatingBrokers.yaml
We suspect it's firing during a restart, perhaps when a pod moves from one node to another and that the query is mistakenly seeing this as two brokers being colocated.
To investigate:
- You can use the following prometheus query to see when it has fired: https://grafana.app-sre.devshift.net/explore?orgId=1&left=%7B%22datasource%22:%22mk-observatorium-production%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22editorMode%22:%22builder%22,%22expr%22:%22ALERTS%7Balertstate%3D%5C%22firing%5C%22,%20alertname%3D%5C%22KafkaColocatingBrokers%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22format%22:%22table%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D
- To see currently firing alerts go to https://grafana.app-sre.devshift.net/d/viefn9LMz/mk-fleet-links?orgId=1&refresh=1m and log into an OpenShift cluster, under Routes in the managed-application-services-observability project click the route for Alertmanager.
- Try running on your own cluster and cordoning the node that the Kafka broker is running on, then draining the Kafka broker pod from the node. You can use the following SOP to get instructions for cordoning and draining nodes: https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/main/sops/kafka/external_access_in_az_failing.asciidoc#3-executeresolution