-
Bug
-
Resolution: Done
-
Critical
-
7.2.1.Final
Since ISPN-4198 was fixed, joiners ignore commands that were sent in a topology in which they were not members. But the joiner will still send a null response, which is valid, so the tx originator will not retry the command.
The owner may have already sent the list of transactions to the joiner before the prepare, so this sequence of events can happen:
1. A new topology is installed, which includes the joiner (B) in the write consistent hash. B requests the transactions from A, but doesn't receive anything because the affected keys of GlobalTransaction:<NodeA-40680>:99974:local haven't been updated yet.
19:14:11,027 TRACE (remote-thread-NodeB-p35288-t2:) [StateConsumerImpl] Received new topology for cache resultCache2, isRebalance = false, isMember = false, topology = CacheTopology{id=0, rebalanceId=0, currentCH=ReplicatedConsistentHash{ns = 1, owners = (1)[NodeA-40680: 1]}, pendingCH=null, unionCH=null, actualMembers=[NodeA-40680]} 19:14:11,039 TRACE (transport-thread-NodeA-p35281-t6:) [StateTransferInterceptor] handleTxCommand for command PrepareCommand {modifications=[PutKeyValueCommand{key=rules, value=1, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}], onePhaseCommit=false, gtx=GlobalTransaction:<NodeA-40680>:99974:local, cacheName='resultCache2', topologyId=-1}, origin null 19:14:11,044 TRACE (transport-thread-NodeA-p35281-t1:) [StateConsumerImpl] Received new topology for cache resultCache2, isRebalance = true, isMember = true, topology = CacheTopology{id=1, rebalanceId=1, currentCH=ReplicatedConsistentHash{ns = 1, owners = (1)[NodeA-40680: 1]}, pendingCH=ReplicatedConsistentHash{ns = 1, owners = (2)[NodeA-40680: 1, NodeB-64486: 0]}, unionCH=ReplicatedConsistentHash{ns = 1, owners = (2)[NodeA-40680: 1, NodeB-64486: 0]}, actualMembers=[NodeA-40680, NodeB-64486]} 19:14:11,045 TRACE (remote-thread-NodeA-p35282-t4:) [StateProviderImpl] Received request for transactions from node NodeB-64486 for segments [0] of cache resultCache2 with topology id 1 19:14:11,045 TRACE (remote-thread-NodeA-p35282-t4:) [StateProviderImpl] Skipping transaction NodeB-64486 because the state requestor doesn't own any key 19:14:11,045 TRACE (remote-thread-NodeA-p35282-t4:) [CommandAwareRpcDispatcher] About to send back response SuccessfulResponse{responseValue=[]} for command StateRequestCommand{cache=resultCache2, origin=NodeB-64486, type=GET_TRANSACTIONS, topologyId=1, segments=[0]} 19:14:11,045 DEBUG (transport-thread-NodeB-p35287-t3:) [StateConsumerImpl] Applying 0 transactions for cache resultCache2 transferred from node NodeA-40680
2. Node A adds the affected key and sends the prepare command to B with topology 0. B ignores the command.
19:14:11,039 TRACE (transport-thread-NodeA-p35281-t6:) [AbstractCacheTransaction] Registering locked key: rules 19:14:11,047 TRACE (transport-thread-NodeA-p35281-t6:) [JGroupsTransport] dests=null, command=PrepareCommand {modifications=[PutKeyValueCommand{key=rules, value=1, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}], onePhaseCommit=false, gtx=GlobalTransaction:<NodeA-40680>:99974:local, cacheName='resultCache2', topologyId=0}, mode=SYNCHRONOUS_IGNORE_LEAVERS, timeout=15000 19:14:11,048 TRACE (OOB-4,NodeB-64486:) [GlobalInboundInvocationHandler] Attempting to execute CacheRpcCommand: PrepareCommand {modifications=[PutKeyValueCommand{key=rules, value=1, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}], onePhaseCommit=false, gtx=GlobalTransaction:<NodeA-40680>:99974:local, cacheName='resultCache2', topologyId=0} [sender=NodeA-40680] 19:14:11,048 TRACE (remote-thread-NodeB-p35288-t6:) [NonTotalOrderPerCacheInboundInvocationHandler] Ignoring command sent before the local node was a member (command topology id is 0) 19:14:11,048 TRACE (remote-thread-NodeB-p35288-t6:) [CommandAwareRpcDispatcher] About to send back response null for command PrepareCommand {modifications=[PutKeyValueCommand{key=rules, value=1, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}], onePhaseCommit=false, gtx=GlobalTransaction:<NodeA-40680>:99974:remote, cacheName='resultCache2', topologyId=0} 19:14:11,058 TRACE (transport-thread-NodeA-p35281-t6:) [RpcManagerImpl] Response(s) to PrepareCommand {modifications=[PutKeyValueCommand{key=rules, value=1, flags=null, putIfAbsent=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedMetadata{version=null}, successful=true}], onePhaseCommit=false, gtx=GlobalTransaction:<NodeA-40680>:99974:local, cacheName='resultCache2', topologyId=0} is {}
3. A tries to commit the tx, but B throws an exception because it can't find the remote tx.
19:14:11,066 TRACE (transport-thread-NodeA-p35281-t6:) [LocalTransaction] Adding remote locks on [NodeA-40680, NodeB-64486]. Remote locks are null 19:14:11,066 TRACE (transport-thread-NodeA-p35281-t6:) [TransactionCoordinator] Committing transaction GlobalTransaction:<NodeA-40680>:99974:local 19:14:11,066 TRACE (transport-thread-NodeA-p35281-t6:) [RpcManagerImpl] NodeA-40680 invoking CommitCommand {gtx=GlobalTransaction:<NodeA-40680>:99974:local, cacheName='resultCache2', topologyId=2} to recipient list null with options RpcOptions{timeout=15000, unit=MILLISECONDS, deliverOrder=NONE, responseFilter=null, responseMode=SYNCHRONOUS_IGNORE_LEAVERS, skipReplicationQueue=false} 19:14:11,066 TRACE (OOB-5,NodeB-64486:) [GlobalInboundInvocationHandler] Attempting to execute CacheRpcCommand: CommitCommand {gtx=GlobalTransaction:<NodeA-40680>:99974:local, cacheName='resultCache2', topologyId=2} [sender=NodeA-40680] 19:14:11,066 TRACE (remote-thread-NodeB-p35288-t6:) [NonTotalOrderPerCacheInboundInvocationHandler] Calling perform() on CommitCommand {gtx=GlobalTransaction:<NodeA-40680>:99974:remote, cacheName='resultCache2', topologyId=2} 19:14:11,066 TRACE (remote-thread-NodeB-p35288-t6:) [AbstractTransactionBoundaryCommand] Did not find a RemoteTransaction for GlobalTransaction:<NodeA-40680>:99974:remote 19:14:11,067 WARN (remote-thread-NodeB-p35288-t6:) [NonTotalOrderPerCacheInboundInvocationHandler] ISPN000071: Caught exception when handling command CommitCommand {gtx=GlobalTransaction:<NodeA-40680>:99974:remote, cacheName='resultCache2', topologyId=2} java.lang.IllegalStateException: Remote transaction not found: GlobalTransaction:<NodeA-40680>:99974:remote at org.infinispan.commands.tx.CommitCommand.invalidRemoteTxReturnValue(CommitCommand.java:54) at org.infinispan.commands.tx.AbstractTransactionBoundaryCommand.perform(AbstractTransactionBoundaryCommand.java:89) at org.infinispan.remoting.inboundhandler.BasePerCacheInboundInvocationHandler.invokePerform(BasePerCacheInboundInvocationHandler.java:84) at org.infinispan.remoting.inboundhandler.BaseBlockingRunnable.run(BaseBlockingRunnable.java:31)
This is causing random failures in the map/reduce tests (e.g. DistributedSharedCacheTwoNodesMapReduceTest), because map/reduce doesn't wait for all the nodes to join before inserting in the cache.