Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-3120

StateConsumerImpl can ignore state received during a rebalance

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      This causes random failures in ConcurrentOverlappingLeaveTest and ConcurrentNonOverlappingLeaveTest.

      1. Starting with a 4-node cluster: [E, F, G, H] (topology 7).
      2. F leaves, and E sends a REBALANCE_START command with nodes [E, G, H] (topology 8). Some segments are owned by [H] in the current CH and by [H, G] in the pending CH.
      3. E reports that it finished receiving state with a REBAlANCE_CONFIRM command.
      4. H leaves, and E sends a CH_UPDATE command with nodes [E, G] (topology 9).
      The segments that were owned by [H] in the previous currentCH are assigned to [E, G] in the new currentCH (otherwise they wouldn't have any owners).
      5. The StateConsumerImpl on E requests state for the "lost" segments from G.
      6. G confirms the end of the rebalance as well, and E sends a CH_UPDATE command to end the rebalance (topology 10).
      7. E sends a REBALANCE_START command to assign all segments for [E, G] (topology 11).
      8. While the StateConsumerImpl on E is starting the state transfer, it also receives a StateResponseCommand for the lost segments from G.
      9. Because the structures keeping track of the received state are not properly initialized, E considers it finished receiving state for topology 11.
      10. E receives a StateResponseCommand from G with actual data, but it ignores it because StateConsumerImpl.updatedKeys == null.

      11:30:39,807 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=7, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=null}
      11:30:39,810 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=8, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339, NodeH-47370]}}
      11:30:39,817 DEBUG (transport-thread-3,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 8.
      11:30:39,832 DEBUG (transport-thread-4,NodeE:dist) [LocalTopologyManagerImpl] Updating local consistent hash(es) for cache dist: new topology = CacheTopology{id=9, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
      11:30:39,834 DEBUG (transport-thread-4,NodeE:dist) [StateConsumerImpl] Adding inbound state transfer for segments [38, 36, 47, 44, 45] of cache dist
      11:30:39,853 DEBUG (transport-thread-3,NodeE:dist) [LocalTopologyManagerImpl] Starting local rebalance for cache dist, topology = CacheTopology{id=11, currentCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}, pendingCH=DefaultConsistentHash{numSegments=60, numOwners=2, members=[NodeE-51027, NodeG-6339]}}
      11:30:39,859 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=9}
      11:30:39,864 DEBUG (remote-thread-1,NodeE:dist) [StateConsumerImpl] Finished receiving of segments for cache dist for topology 11.
      11:30:39,866 TRACE (transport-thread-5,NodeE:dist) [LocalTopologyManagerImpl] Ignoring consistent hash update 10 for cache dist, we have already received a newer topology 11
      11:30:39,868 TRACE (remote-thread-1,NodeE:) [InboundInvocationHandlerImpl] Calling perform() on StateResponseCommand{cache=dist, origin=NodeG-6339, topologyId=11}
      11:30:39,872 TRACE (remote-thread-1,NodeE:dist dist) [EntryWrappingInterceptor] State transfer will not write key/value MagicKey#k3{672f69c9@NodeG-6339}/v3 because it was already updated by somebody else
      11:30:40,582 ERROR (testng-ConcurrentNonOverlappingLeaveTest:) [UnitTestTestNGListener] Test testTransactional(org.infinispan.distribution.rehash.ConcurrentNonOverlappingLeaveTest) failed.
      java.lang.AssertionError: Fail on owner cache NodeE-51027: dc.get(MagicKey#k3{672f69c9@NodeG-6339}) returned null!
      

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: