Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-3839

The broker stuck in an inconsistent state after Zookeeper disconnection

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • 2.1.0.GA
    • 1.8.4.GA
    • kafka-broker, zookeeper
    • None
    • False
    • False
    • Hide

      Restart the controller

      Show
      Restart the controller

      It seems that a connection issue with the ZK nodes [1] triggered Broker 5 and 1 to try to reconnect but also to resign after the first auth failure [2] that is expected because no auth has been configured. The concurrent resignation left Broker 1 in an inconsistent state. When the conditions are me the issue occurs just sometimes, in fact, most of the time the broker is able to recover by itself. It seems the issue KAFKA-13461 [3] solved a few days ago. Also, the workaround, that is restart the controller, seems compatible with the recently solved Kafka bug.

      [1]

      broker1-server.log:2022-02-17 05:08:49,801 - WARN  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:08:49,803 - WARN  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:08:54,802 - INFO  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:08:54,802 - INFO  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:36:29,861 - WARN  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:36:29,862 - INFO  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - WARN  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - WARN  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - INFO  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - INFO  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:38:49,062 - WARN  [zk-client-Kafkaserver-reinit-0-SendThread(zk-2-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:38:49,062 - INFO  [zk-client-Kafkaserver-reinit-0-SendThread(zk-2-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      

      [2]

      2022-02-17 05:38:35,760 - INFO  [zk-client-Kafkaserver-reinit-0:Logging@66] - [ZooKeeperClient Kafka server] Reinitializing due to auth failure.
      2022-02-17 05:38:35,762 - DEBUG [controller-event-thread:Logging@62] - [Controller id=5] Resigning
      2022-02-17 05:38:50,355 - INFO  [zk-client-Kafkaserver-reinit-0:Logging@66] - [ZooKeeperClient Kafka server] Reinitializing due to auth failure.
      2022-02-17 05:38:50,531 - INFO  [controller-event-thread:Logging@66] - [Controller id=1] Resigned
      

      [3]
      https://issues.apache.org/jira/browse/KAFKA-13461

      Basically, when there is no JAAS configured for ZK client and the ZK client tries to establish a new connection, the client will first receive an AUTH_FAIL event. However, this doesn't mean that the ZK client's session is gone since the client will retry the connection without auth, which typically succeeds. Previously, we mistakenly try to reinitialize the controller with the AUTH_FAIL event, which causes the controller to resign but not regain the controllership since the underlying session is still valid.

              Unassigned Unassigned
              rhn-support-agagliar Antonio Gagliardi
              Lukas Kral Lukas Kral
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: