-
Task
-
Resolution: Done
-
Major
-
2.0.1.GA
-
False
-
None
-
False
-
MK - Sprint 235
There could be a situation where a Kafka broker is restarted and it has to take a recovery action (i.e. producers snapshots, re-building indexes, ...) which takes a while (i.e. even hours) when the Strimzi operator wants to roll it for any other reasons.
The current KafkaRoller implementation doesn't take into account that the broker is recovering because it doesn't have a way to get that piece of information. Because of the Kafka broker "not listening" during the recovery phase, the KafkaRoller doesn't get any response to its admin calls. For this reason, after a timeout, the KafkaRoller just forces the broker to roll. This can continue forever because of the recovery on the broker startup again.
To address this issue, the KafkaRoller should have a way to know that a broker is recovering so making the right decision, waiting more time and not forcing the rolling after a timeout but even waiting longer for the recovery to end. The idea would be to make the kafka.server:type=KafkaServer,name=BrokerState metric available to the KafkaRoller to get this information.
Today, this metric is already used by the kafka-agent [2] in order to create the kafka-ready file on the broker pod which is used in the liveness probe. ENTMQST-4418 will expose this information through the kafka-agent itself or Kafka's metrics reporter.
- depends on
-
ENTMQST-4418 PoC of KafkaAgent in Strimzi for exposing metrics
- Closed