-
Bug
-
Resolution: Done
-
Critical
-
5.3.0.Final
This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.
1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
5. The StateConsumerImpl on E requests state for the "lost" segments from G.
6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
10. E receives a StateResponseCommand from G with actual data, but it ignores it because StateConsumerImpl.updatedKeys == null.
11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null} 11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}} 11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8. 11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}} 11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist 11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}} 11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9} 11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11. 11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11 11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11} 11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else 11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed. java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!