Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-13147

State transfer can lose entries when a node leaves

XMLWordPrintable

      When a node starts state transfer, it enables state transfer key tracking before requesting data from the current owners.
      While key tracking is disabled, writes with the PUT_FOR_STATE_TRANSFER flag are ignored.

      This causes a problem when another node leaves during state transfer.
      The coordinator first sends a ConsistentHashUpdateCommand only removing the leaver, but then immediately sends a RebalanceStartCommand that potentially adds more segments.

      StateConsumerImpl receives the topology update with isRebalance==true and disables key tracking for a short while, before enabling it again.
      Any entries that are supposed to be inserted by state transfer in the short interval while key tracking is disabled will be ignored.
      Disabling key tracking also clears the entry updates by regular writes already tracked, so older values from state transfer will be able to overwrite newer values from regular writes.

      Causes random failures in StateTransferRestart2Test.testStateTransferRestart

      14:26:02,776 TRACE (non-blocking-thread-Test-NodeC-p13822-t5:[]) [StateConsumerImpl] Received new topology for cache defaultcache, isRebalance = true, isMember = true, topology = CacheTopology{id=6, phase=READ_OLD_WRITE_ALL, rebalanceId=3, currentCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 131+125, Test-NodeB: 125+131]}, pendingCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeA: 87+79, Test-NodeB: 83+89, Test-NodeC: 86+88]}, unionCH=DefaultConsistentHash{ns=256, owners = (3)[Test-NodeA: 131+125, Test-NodeB: 125+131, Test-NodeC: 0+174]}, actualMembers=[Test-NodeA, Test-NodeB, Test-NodeC], persistentUUIDs=[8e4c4778-f8fc-43b7-b3a5-d5900ecfc4f2, 793631fe-0bb9-465d-ad1d-3db483652dcc, 62b2f37f-01bf-4e8c-bee6-1562dbb2a679]}
      14:26:02,776 TRACE (non-blocking-thread-Test-NodeC-p13822-t5:[]) [CommitManager] Set track to PUT_FOR_STATE_TRANSFER = false
      14:26:02,776 TRACE (non-blocking-thread-Test-NodeC-p13822-t5:[]) [CommitManager] Set track to PUT_FOR_STATE_TRANSFER = true
      14:26:02,778 TRACE (jgroups-11,Test-NodeC:[]) [StateConsumerImpl] Adding transfer from Test-NodeA for segments {4-18 66 68 72-74 76-77 82-96 137-146 149-161 204-211 213-219}
      
      14:26:07,212 TRACE (non-blocking-thread-Test-NodeC-p13822-t3:[]) [StateConsumerImpl] Received new topology for cache defaultcache, isRebalance = false, isMember = true, topology = CacheTopology{id=7, phase=READ_OLD_WRITE_ALL, rebalanceId=3, currentCH=DefaultConsistentHash{ns=256, owners = (1)[Test-NodeA: 256+0]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 129+37, Test-NodeC: 127+47]}, unionCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 256+0, Test-NodeC: 0+174]}, actualMembers=[Test-NodeA, Test-NodeC], persistentUUIDs=[8e4c4778-f8fc-43b7-b3a5-d5900ecfc4f2, 62b2f37f-01bf-4e8c-bee6-1562dbb2a679]}
      14:26:07,220 TRACE (jgroups-8,Test-NodeC:[]) [StateConsumerImpl] Topology update processed, stateTransferTopologyId = 6, startRebalance = false, pending CH = DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 129+37, Test-NodeC: 127+47]}
      
      14:26:07,224 TRACE (jgroups-11,Test-NodeC:[defaultcache]) [StateConsumerImpl] Applying new state chunk for segment 137 of cache defaultcache from node Test-NodeA: received 1 cache entries
      
      14:26:07,225 TRACE (non-blocking-thread-Test-NodeC-p13822-t6:[]) [StateConsumerImpl] Received new topology for cache defaultcache, isRebalance = true, isMember = true, topology = CacheTopology{id=8, phase=READ_OLD_WRITE_ALL, rebalanceId=4, currentCH=DefaultConsistentHash{ns=256, owners = (1)[Test-NodeA: 256+0]}, pendingCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 131+125, Test-NodeC: 125+131]}, unionCH=DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 256+0, Test-NodeC: 0+256]}, actualMembers=[Test-NodeA, Test-NodeC], persistentUUIDs=[8e4c4778-f8fc-43b7-b3a5-d5900ecfc4f2, 62b2f37f-01bf-4e8c-bee6-1562dbb2a679]}
      14:26:07,229 TRACE (jgroups-8,Test-NodeC:[]) [StateConsumerImpl] Topology update processed, stateTransferTopologyId = 6, startRebalance = true, pending CH = DefaultConsistentHash{ns=256, owners = (2)[Test-NodeA: 131+125, Test-NodeC: 125+131]}
      
      14:26:07,226 TRACE (jgroups-11,Test-NodeC:[defaultcache]) [LocalTransaction] Adding modification PutKeyValueCommand{key=32, value=32, flags=[CACHE_MODE_LOCAL, SKIP_LOCKING, SKIP_REMOTE_LOOKUP, PUT_FOR_STATE_TRANSFER, SKIP_SHARED_CACHE_STORE, SKIP_OWNERSHIP_CHECK, IGNORE_RETURN_VALUES, SKIP_XSITE_BACKUP, IRAC_STATE], commandInvocationId=, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=InternalMetadataImpl{actual=EmbeddedMetadata{version=null}, created=-1, lastUsed=-1}, internalMetadata=PrivateMetadata{iracMetadata=null, entryVersion=SimpleClusteredVersion{topologyId=5, version=1}}, successful=true, topologyId=7}. Mod list is null
      14:26:07,226 TRACE (jgroups-11,Test-NodeC:[defaultcache]) [InvocationContextInterceptor] Invoked with command VersionedPrepareCommand {modifications=null, onePhaseCommit=true, retried=false, versionsSeen=null, gtx=GlobalTransaction{id=9656, addr=Test-NodeC, remote=false, xid=null, internalId=-1}, cacheName='defaultcache'} and InvocationContext [org.infinispan.context.impl.LocalTxInvocationContext@79343c46]
      14:26:07,226 TRACE (non-blocking-thread-Test-NodeC-p13822-t6:[]) [StateConsumerImpl] Start keeping track of keys for rebalance
      14:26:07,226 TRACE (non-blocking-thread-Test-NodeC-p13822-t6:[]) [CommitManager] Set track to PUT_FOR_STATE_TRANSFER = false
      14:26:07,227 TRACE (non-blocking-thread-Test-NodeC-p13822-t6:[]) [CommitManager] Set track to PUT_FOR_STATE_TRANSFER = true
      14:26:07,227 TRACE (jgroups-11,Test-NodeC:[defaultcache]) [CommitManager] Not committing key=32. It is a state transfer key but no track is enabled!
      
      14:26:37,706 ERROR (testng-Test:[]) [TestSuiteProgress] Test failed: org.infinispan.statetransfer.StateTransferRestart2Test.testStateTransferRestart
      java.lang.AssertionError: expected:<100>, got:<99>
      	at org.testng.AssertJUnit.fail(AssertJUnit.java:59) ~[testng-6.14.3.jar:?]
      	at org.infinispan.test.AbstractInfinispanTest.eventually(AbstractInfinispanTest.java:237) ~[test-classes/:?]
      	at org.infinispan.test.AbstractInfinispanTest.eventually(AbstractInfinispanTest.java:216) ~[test-classes/:?]
      	at org.infinispan.test.AbstractInfinispanTest.eventuallyEquals(AbstractInfinispanTest.java:206) ~[test-classes/:?]
      	at org.infinispan.statetransfer.StateTransferRestart2Test.testStateTransferRestart(StateTransferRestart2Test.java:99) ~[test-classes/:?]
      

              Unassigned Unassigned
              dberinde@redhat.com Dan Berindei (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: