Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-3931

KafkaRoller: Avoid restarting brokers in LogRecovery

XMLWordPrintable

    • MK - Sprint 235

      There could be a situation where a Kafka broker is restarted and it has to take a recovery action (i.e. producers snapshots, re-building indexes, ...) which takes a while (i.e. even hours) when the Strimzi operator wants to roll it for any other reasons.

      The current KafkaRoller implementation doesn't take into account that the broker is recovering because it doesn't have a way to get that piece of information. Because of the Kafka broker "not listening" during the recovery phase, the KafkaRoller doesn't get any response to its admin calls. For this reason, after a timeout, the KafkaRoller just forces the broker to roll. This can continue forever because of the recovery on the broker startup again.

      To address this issue, the KafkaRoller should have a way to know that a broker is recovering so making the right decision, waiting more time and not forcing the rolling after a timeout but even waiting longer for the recovery to end. The idea would be to make the kafka.server:type=KafkaServer,name=BrokerState metric available to the KafkaRoller to get this information.

      Today, this metric is already used by the kafka-agent [2] in order to create the kafka-ready file on the broker pod which is used in the liveness probe. ENTMQST-4418 will expose this information through the kafka-agent itself or Kafka's metrics reporter.

              rh-ee-gselenge Gantigmaa Selenge
              ppatiern Paolo Patierno
              Jan Kalinic Jan Kalinic
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: