-
Enhancement
-
Resolution: Done
-
Major
-
None
-
None
-
None
-
3
-
False
-
None
-
False
-
No
-
---
-
---
-
MK - Sprint 231
Overview
This is a proposal for an alert or alter current alert that fires when one of the zookeeper pods is down for X minutes. Currently, the alert ZookeeperContainersDown only fires when more than one zookeeper pods are down for 10 minutes. While this makes a lot of sense since zookeeper is designed to be HA, there was an incident observed recently that when one of the zookeeper pod was down for at least 40 minutes (before it was restarted), the strimzi-cluster-operator was affected and caused to fire the StrimziKafkaStuck alert.
The proposed alert will;
1. Fire when a zookeeper pod is down for X mins.
2. Help us investigate and get to the bottom of why a zookeeper pod is down as soon as possible and prevent further compounded failures (e.g. the case of the strimzi-cluster-operator).