XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: observability
Labels:
- kafka-integrations-europe-refinement-done

Blocked:
False
Blocked Reason:
None
Ready:
False
Discussed with Team:
No
Git Pull Request:
https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/pull/254
[QE] How to address?:
---
[QE] Why QE missed?:
---
Intelligence Requested:
Market:

Sprint:
MK - Sprint 231

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

WHAT

Identify why the KafkaColocatingBrokers alert is firing so frequently.

WHY

During Oct. 21 ~ Oct. 28, the `KafkaColocatingBrokers` alert fired 93 times but seems to resolve itself. This is unexpected behaviour because brokers can't move between AZs. This is because the storage volumes for individual brokers is tied to an AZ. So we would not expect this alert to fire and resolve itself.

How

Investigate when this alert has fired, identify the cause and fix if possible.
The alert is coded here: https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/prometheus-rules.yaml#L160
There is a unit test here: https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/unit_tests/KafkaColocatingBrokers.yaml

We suspect it's firing during a restart, perhaps when a pod moves from one node to another and that the query is mistakenly seeing this as two brokers being colocated.

To investigate:

You can use the following prometheus query to see when it has fired: https://grafana.app-sre.devshift.net/explore?orgId=1&left=%7B%22datasource%22:%22mk-observatorium-production%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22editorMode%22:%22builder%22,%22expr%22:%22ALERTS%7Balertstate%3D%5C%22firing%5C%22,%20alertname%3D%5C%22KafkaColocatingBrokers%5C%22%7D%22,%22legendFormat%22:%22__auto%22,%22range%22:true,%22instant%22:true,%22format%22:%22table%22%7D%5D,%22range%22:%7B%22from%22:%22now-7d%22,%22to%22:%22now%22%7D%7D
To see currently firing alerts go to https://grafana.app-sre.devshift.net/d/viefn9LMz/mk-fleet-links?orgId=1&refresh=1m and log into an OpenShift cluster, under Routes in the managed-application-services-observability project click the route for Alertmanager.
Try running on your own cluster and cordoning the node that the Kafka broker is running on, then draining the Kafka broker pod from the node. You can use the following SOP to get instructions for cordoning and draining nodes: https://github.com/bf2fc6cc711aee1a0c2a/kas-sre-sops/blob/main/sops/kafka/external_access_in_az_failing.asciidoc#3-executeresolution

mentioned on

Merge request - MGDSTRM-9181: upgrade kas-fleet-manager observability tag in production.

Assignee:: Keith Wall

Reporter:: Luke Chen

Team:: Kafka Integrations

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2022/11/03 9:06 AM

Updated:: 2023/01/25 10:11 AM

Resolved:: 2023/01/25 10:11 AM

Details

Description

WHAT

WHY

How

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates