-
Bug
-
Resolution: Done
-
Major
-
5.2.0.Final
-
Workaround Exists
-
When the coordinator sends a REBALANCE_START command, it holds a lock on the ClusterCacheStatus until it receives the responses from all the other members.
If a node doesn't need to request any new state, it sends the rebalance confirmation to the coordinator on the same thread that received the REBALANCE_START command. The REBALANCE_CONFIRM command also wants to acquire a lock on the ClusterCacheStatus on the coordinator, but because the REBALANCE_CONFIRM command is sent asynchronously, it doesn't deadlock with the thread waiting for REBALANCE_START responses on the coordinator.
At least, that's what happens when RSVP.ack_on_delivery=false (the Infinispan default). When RSVP.ack_on_delivery=true (the JGroups default), the "asynchronous" REBALANCE_CONFIRM command becomes synchronous, and it generates a deadlock. The rebalance then fails after the RSVP timeout expires (10 seconds by default).