Uploaded image for project: 'AMQ Streams'
  1. AMQ Streams
  2. ENTMQST-3839

The broker stuck in an inconsistent state after Zookeeper disconnection

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • 2.1.0.GA
    • 1.8.4.GA
    • kafka-broker, zookeeper
    • None
    • False
    • False
    • Hide

      Restart the controller

      Show
      Restart the controller

    Description

      It seems that a connection issue with the ZK nodes [1] triggered Broker 5 and 1 to try to reconnect but also to resign after the first auth failure [2] that is expected because no auth has been configured. The concurrent resignation left Broker 1 in an inconsistent state. When the conditions are me the issue occurs just sometimes, in fact, most of the time the broker is able to recover by itself. It seems the issue KAFKA-13461 [3] solved a few days ago. Also, the workaround, that is restart the controller, seems compatible with the recently solved Kafka bug.

      [1]

      broker1-server.log:2022-02-17 05:08:49,801 - WARN  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:08:49,803 - WARN  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:08:54,802 - INFO  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:08:54,802 - INFO  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:36:29,861 - WARN  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:36:29,862 - INFO  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - WARN  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - WARN  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - INFO  [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker5-server.log:2022-02-17 05:38:33,909 - INFO  [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:38:49,062 - WARN  [zk-client-Kafkaserver-reinit-0-SendThread(zk-2-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server
      broker1-server.log:2022-02-17 05:38:49,062 - INFO  [zk-client-Kafkaserver-reinit-0-SendThread(zk-2-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
      

      [2]

      2022-02-17 05:38:35,760 - INFO  [zk-client-Kafkaserver-reinit-0:Logging@66] - [ZooKeeperClient Kafka server] Reinitializing due to auth failure.
      2022-02-17 05:38:35,762 - DEBUG [controller-event-thread:Logging@62] - [Controller id=5] Resigning
      2022-02-17 05:38:50,355 - INFO  [zk-client-Kafkaserver-reinit-0:Logging@66] - [ZooKeeperClient Kafka server] Reinitializing due to auth failure.
      2022-02-17 05:38:50,531 - INFO  [controller-event-thread:Logging@66] - [Controller id=1] Resigned
      

      [3]
      https://issues.apache.org/jira/browse/KAFKA-13461

      Basically, when there is no JAAS configured for ZK client and the ZK client tries to establish a new connection, the client will first receive an AUTH_FAIL event. However, this doesn't mean that the ZK client's session is gone since the client will retry the connection without auth, which typically succeeds. Previously, we mistakenly try to reinitialize the controller with the AUTH_FAIL event, which causes the controller to resign but not regain the controllership since the underlying session is still valid.

      Attachments

        Activity

          People

            Unassigned Unassigned
            rhn-support-agagliar Antonio Gagliardi
            Lukas Kral Lukas Kral
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: