-
Bug
-
Resolution: Done
-
Critical
-
1.8.4.GA
-
None
-
False
-
False
-
-
-
It seems that a connection issue with the ZK nodes [1] triggered Broker 5 and 1 to try to reconnect but also to resign after the first auth failure [2] that is expected because no auth has been configured. The concurrent resignation left Broker 1 in an inconsistent state. When the conditions are me the issue occurs just sometimes, in fact, most of the time the broker is able to recover by itself. It seems the issue KAFKA-13461 [3] solved a few days ago. Also, the workaround, that is restart the controller, seems compatible with the recently solved Kafka bug.
[1]
broker1-server.log:2022-02-17 05:08:49,801 - WARN [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server broker1-server.log:2022-02-17 05:08:49,803 - WARN [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server broker1-server.log:2022-02-17 05:08:54,802 - INFO [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server broker1-server.log:2022-02-17 05:08:54,802 - INFO [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server broker5-server.log:2022-02-17 05:36:29,861 - WARN [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server broker5-server.log:2022-02-17 05:36:29,862 - INFO [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server broker5-server.log:2022-02-17 05:38:33,909 - WARN [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server broker5-server.log:2022-02-17 05:38:33,909 - WARN [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server broker5-server.log:2022-02-17 05:38:33,909 - INFO [main-SendThread(zk-3-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server broker5-server.log:2022-02-17 05:38:33,909 - INFO [main-SendThread(zk-1-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server broker1-server.log:2022-02-17 05:38:49,062 - WARN [zk-client-Kafkaserver-reinit-0-SendThread(zk-2-kback:2181):ClientCnxn$SendThread@1190] - Client session timed out, have not heard from server broker1-server.log:2022-02-17 05:38:49,062 - INFO [zk-client-Kafkaserver-reinit-0-SendThread(zk-2-kback:2181):ClientCnxn$SendThread@1238] - Client session timed out, have not heard from server
[2]
2022-02-17 05:38:35,760 - INFO [zk-client-Kafkaserver-reinit-0:Logging@66] - [ZooKeeperClient Kafka server] Reinitializing due to auth failure. 2022-02-17 05:38:35,762 - DEBUG [controller-event-thread:Logging@62] - [Controller id=5] Resigning 2022-02-17 05:38:50,355 - INFO [zk-client-Kafkaserver-reinit-0:Logging@66] - [ZooKeeperClient Kafka server] Reinitializing due to auth failure. 2022-02-17 05:38:50,531 - INFO [controller-event-thread:Logging@66] - [Controller id=1] Resigned
[3]
https://issues.apache.org/jira/browse/KAFKA-13461
Basically, when there is no JAAS configured for ZK client and the ZK client tries to establish a new connection, the client will first receive an AUTH_FAIL event. However, this doesn't mean that the ZK client's session is gone since the client will retry the connection without auth, which typically succeeds. Previously, we mistakenly try to reinitialize the controller with the AUTH_FAIL event, which causes the controller to resign but not regain the controllership since the underlying session is still valid.