-
Enhancement
-
Resolution: Done
-
Major
-
None
-
None
-
None
What
We want an alert that will fire if the disk usage on some broker is much higher than the others. This will be the trigger for the SREs to run Cruise Control so that the imbalance is remediated.
The alert will be written to that it detects the condition in the 3 broker case too, as there are some corner cases there that could lead to that condition (modifications to RF). We accept the fact that the Cruise Control MVP won't remediate these problems, but it is still preferable to know the condition exists.
How
(KW) I was wonder if we could take a statistical approach, perhaps using ideas from https://prometheus.io/blog/2015/06/18/practical-anomaly-detection/. disk usage on "broker > x% and disk usage > n standard deviations above the mean disk usage for the whole cluster".
Done
- New alerting rule added to https://github.com/bf2fc6cc711aee1a0c2a/observability-resources-mk/blob/main/resources/prometheus/prometheus-rules.yaml
- blocks
-
MGDSTRM-8041 Create SOP to remediate severe disk skew
- Closed