Loading...

XML

Word

Printable

Type: Enhancement
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- kafka-integrations-apac-refinement-done
- kafka-integrations-europe-refinement-done

Epic Link:
Cruise Control MVP
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Discussed with Team:
No
Git Pull Request:
https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/pull/196

Sprint:
MK - Sprint 218

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

What

We want an alert that will fire if the disk usage on some broker is much higher than the others. This will be the trigger for the SREs to run Cruise Control so that the imbalance is remediated.

The alert will be written to that it detects the condition in the 3 broker case too, as there are some corner cases there that could lead to that condition (modifications to RF). We accept the fact that the Cruise Control MVP won't remediate these problems, but it is still preferable to know the condition exists.

How

(KW) I was wonder if we could take a statistical approach, perhaps using ideas from https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/. disk usage on "broker > x% and disk usage > n standard deviations above the mean disk usage for the whole cluster".

Done

New alerting rule added to https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/prometheus-rules.yaml

blocks

MGDSTRM-8041 Create SOP to remediate severe disk skew

Closed

Assignee:: Kate Stanley (Inactive)

Reporter:: Keith Wall

Team:: Kafka Integrations

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2022/04/01 9:08 AM

Updated:: 2022/09/09 6:24 AM

Resolved:: 2022/05/09 9:01 AM

Details

Description

What

How

Done

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates