-
Bug
-
Resolution: Done
-
Major
-
5.2.0.Final
-
None
The LEAVE command is sent asynchronously, so if the cache is restarted it is possible for the new JOIN command to be processed before the LEAVE command on the coordinator.
This doesn't work out very well: as the joining node is already present in the consistent hash during join, it won't do any state transfer. After that, it will receive a topology update with itself removed from the consistent hash.
I have seen one failure because of this in StateTransferFunctionalTest.testInitialStateTransferAfterRestart:
03:25:36,749 TRACE (testng-StateTransferFunctionalTest:) [JGroupsTransport] dests=[NodeG-42396], command=CacheTopologyControlCommand{cache=nbst, type=LEAVE, sender=NodeH-44562, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=1}, mode=ASYNCHRONOUS_WITH_SYNC_MARSHALLING, timeout=0 03:25:36,770 TRACE (testng-StateTransferFunctionalTest:) [JGroupsTransport] dests=[NodeG-42396], command=CacheTopologyControlCommand{cache=nbst, type=JOIN, sender=NodeH-44562, joinInfo=CacheJoinInfo{consistentHashFactory=org.infinispan.distribution.ch.ReplicatedConsistentHashFactory@335703e5, hashFunction=org.infinispan.commons.hash.MurmurHash3@64b6f0a5, numSegments=60, numOwners=2, timeout=240000}, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=1}, mode=SYNCHRONOUS, timeout=240000 03:25:36,771 TRACE (OOB-1,ISPN,NodeG-42396:) [CommandAwareRpcDispatcher] Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=nbst, type=JOIN, sender=NodeH-44562, joinInfo=CacheJoinInfo{consistentHashFactory=org.infinispan.distribution.ch.ReplicatedConsistentHashFactory@3aea6b42, hashFunction=org.infinispan.commons.hash.MurmurHash3@7427d845, numSegments=60, numOwners=2, timeout=240000}, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=1} [sender=NodeH-44562] 03:25:36,771 TRACE (testng-StateTransferFunctionalTest:) [StateTransferManagerImpl] Installing new cache topology CacheTopology{id=2, currentCH=ReplicatedConsistentHash{members=[NodeG-42396, NodeH-44562]}, pendingCH=null} on cache nbst 03:25:36,782 TRACE (OOB-2,ISPN,NodeG-42396:) [CommandAwareRpcDispatcher] Attempting to execute non-CacheRpcCommand command: CacheTopologyControlCommand{cache=nbst, type=LEAVE, sender=NodeH-44562, joinInfo=null, topologyId=0, currentCH=null, pendingCH=null, throwable=null, viewId=1} [sender=NodeH-44562] 03:25:36,840 TRACE (OOB-2,ISPN,NodeG-42396:nbst nbst) [StateTransferManagerImpl] Installing new cache topology CacheTopology{id=3, currentCH=ReplicatedConsistentHash{members=[NodeG-42396]}, pendingCH=null} on cache nbst 03:25:36,852 TRACE (OOB-2,ISPN,NodeH-44562:nbst) [StateTransferManagerImpl] Installing new cache topology CacheTopology{id=3, currentCH=ReplicatedConsistentHash{members=[NodeG-42396]}, pendingCH=null} on cache nbst
The solution is be to make the LEAVE command synchronous.