Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-1556

Exponential backoff in KafkaRoller sometimes takes 20+ minutes

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 1.5.0.GA
    • 1.4.0.GA
    • None
    • None
    • 2019.15, 2020.1, 2020.2

      When rolling Kafka pods, we use exponential backoff to try to roll the cluster cleanly before force rolling it (for example when the pods don't work). The exponential backoff should be reasonably fast, doing 10 loops between 250ms - 64000ms of waiting between them.

      When the pods are not created at all (e.g. because they cannot be scheduled), it seems to work as desired. However, when they are already created, the exponential backoff takes more than 20 minutes because of waiting for the AdminClient to be closed.

      One of the ways how to reproduce it is to enable Kafka rack-awareness using a label which does not exist. That will make the pod to exist, but the init-container fail and crashloop. Then edit the resource and fix the label or disable rack awareness. The CO is supposed to see the change and apply it.

      However, instead in this situation, the AdminClient for whatever reason seems to connect somewhere (don't ask where, maybe Kubernetes somehow keep the ports open or maybe it is a bug in the AdminClient which doesn't handle properly when the address exists but nobody listens on the port). It gets up to the point where we try to get the controller node:

      DescribeClusterResult describeClusterResult = ac.describeCluster();
      KafkaFuture<Node> controller = describeClusterResult.controller();
      controllerNode = controller.get(timeout, unit);
      

      The controller.get is using timeout of 1 minute currently. So it wait for 1 minute and than it throws TimeoutException:

      java.util.concurrent.TimeoutException: null
      --
        | at org.apache.kafka.common.internals.KafkaFutureImpl$SingleWaiter.await(KafkaFutureImpl.java:108) ~[org.apache.kafka.kafka-clients-2.3.1.jar:?]
        | at org.apache.kafka.common.internals.KafkaFutureImpl.get(KafkaFutureImpl.java:272) ~[org.apache.kafka.kafka-clients-2.3.1.jar:?]
        | at io.strimzi.operator.cluster.operator.resource.KafkaRoller.controller(KafkaRoller.java:427) ~[io.strimzi.cluster-operator-0.16.0-SNAPSHOT.jar:0.16.0-SNAPSHOT]
        | at io.strimzi.operator.cluster.operator.resource.KafkaRoller.restartIfNecessary(KafkaRoller.java:251) ~[io.strimzi.cluster-operator-0.16.0-SNAPSHOT.jar:0.16.0-SNAPSHOT]
        | at io.strimzi.operator.cluster.operator.resource.KafkaRoller.lambda$schedule$5(KafkaRoller.java:194) ~[io.strimzi.cluster-operator-0.16.0-SNAPSHOT.jar:0.16.0-SNAPSHOT]
        | at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_222]
        | at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_222]
        | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_222]
        | at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_222]
        | at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_222]
        | at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_222]
        | at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]
      

      The exception is not handled in given try block, and instead of dealing with the exception, we try to close the client (by calling the method closeLoggingAnyError). This method sees that the AdminClient is not `null` and try to close it with timeout of another 2 minutes:

      adminClient.close(Duration.ofMinutes(2));
      

      As a result, the exponential backoff is really taking just milliseconds. But we are taking a lot of time between the backoffs waiting for stuff. And during the whole time, we do not print a single message to the log on a log level of INFO or above.

      tbentley-1 Is this desired? Do we really need to wait for the 1 minute for the controller and 2 minutes to try to close a client which is not connected anywhere? Any ideas how to best improve this?

      This JIRA was created based on Strimzi#2263

              Unassigned Unassigned
              scholzj JAkub Scholz
              Lukas Kral Lukas Kral
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: