Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-3936

Deadlock during synchronization with replicated journal

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Blocker Blocker
    • None
    • 7.0.0.ER7
    • ActiveMQ
    • None
    • Regression
    • Hide

      Steps to reproduce the issue - not 100% reproducer:

      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      git checkout refactoring_modules
      groovy -DEAP_VERSION=7.0.0.ER7 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      cd ../jboss-hornetq-testsuite/
      mvn clean test -Dtest=JournalReplicationNioBlockTestCase#journalReplicationWithoutNetworkProblemTest  -DfailIfNoTests=false -Deap=7x   | tee log
      
      Show
      Steps to reproduce the issue - not 100% reproducer: git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout refactoring_modules groovy -DEAP_VERSION=7.0.0.ER7 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=JournalReplicationNioBlockTestCase#journalReplicationWithoutNetworkProblemTest -DfailIfNoTests= false -Deap=7x | tee log

      Regression against EAP 7.0.0.ER6 was hit in scenario where backup was synchronizing with live server. 2 EAP 7 servers are configured in dedicated topology with replicated journal. During synchronization deadlock occurs on live server. Attaching full thread dump from live server.

      Deadlock was detected by "jstack" and points to following threads:

      Java stack information for the threads listed above:
      ===================================================
      "Thread-101":
      	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.isWritable(NettyConnection.java:106)
      	- waiting to lock <0x00000000fee15ca8> (a java.util.concurrent.LinkedBlockingDeque)
      	at org.apache.activemq.artemis.spi.core.protocol.AbstractRemotingConnection.isWritable(AbstractRemotingConnection.java:55)
      	at org.apache.activemq.artemis.core.replication.ReplicationManager.sendReplicatePacket(ReplicationManager.java:345)
      	- locked <0x00000000fee16168> (a java.lang.Object)
      	at org.apache.activemq.artemis.core.replication.ReplicationManager.sendReplicatePacket(ReplicationManager.java:329)
      	at org.apache.activemq.artemis.core.replication.ReplicationManager.sendLargeFile(ReplicationManager.java:540)
      	at org.apache.activemq.artemis.core.replication.ReplicationManager.syncLargeMessageFile(ReplicationManager.java:485)
      	at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.sendLargeMessageFiles(JournalStorageManager.java:521)
      	at org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager.startReplication(JournalStorageManager.java:384)
      	at org.apache.activemq.artemis.core.server.impl.SharedNothingLiveActivation$2.run(SharedNothingLiveActivation.java:160)
      	at java.lang.Thread.run(Thread.java:745)
      "default I/O-15":
      	at org.apache.activemq.artemis.core.replication.ReplicationManager.readyForWriting(ReplicationManager.java:380)
      	- waiting to lock <0x00000000fee16168> (a java.lang.Object)
      	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection.fireReady(NettyConnection.java:126)
      	- locked <0x00000000fee15ca8> (a java.util.concurrent.LinkedBlockingDeque)
      	at org.apache.activemq.artemis.core.remoting.impl.netty.NettyAcceptor$Listener.connectionReadyForWrites(NettyAcceptor.java:676)
      	at org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.channelWritabilityChanged(ActiveMQChannelHandler.java:61)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelWritabilityChanged(AbstractChannelHandlerContext.java:366)
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelWritabilityChanged(AbstractChannelHandlerContext.java:348)
      	at io.netty.channel.ChannelInboundHandlerAdapter.channelWritabilityChanged(ChannelInboundHandlerAdapter.java:119)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelWritabilityChanged(AbstractChannelHandlerContext.java:366)
      	at io.netty.channel.AbstractChannelHandlerContext.fireChannelWritabilityChanged(AbstractChannelHandlerContext.java:348)
      	at io.netty.channel.DefaultChannelPipeline.fireChannelWritabilityChanged(DefaultChannelPipeline.java:861)
      	at io.netty.channel.ChannelOutboundBuffer.fireChannelWritabilityChanged(ChannelOutboundBuffer.java:589)
      	at io.netty.channel.ChannelOutboundBuffer.setWritable(ChannelOutboundBuffer.java:555)
      	at io.netty.channel.ChannelOutboundBuffer.decrementPendingOutboundBytes(ChannelOutboundBuffer.java:198)
      	at io.netty.channel.ChannelOutboundBuffer.remove(ChannelOutboundBuffer.java:263)
      	at org.xnio.netty.transport.AbstractXnioSocketChannel.doWrite(AbstractXnioSocketChannel.java:174)
      	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush0(AbstractChannel.java:765)
      	at org.xnio.netty.transport.AbstractXnioSocketChannel$AbstractXnioUnsafe.flush0(AbstractXnioSocketChannel.java:363)
      	at io.netty.channel.AbstractChannel$AbstractUnsafe.flush(AbstractChannel.java:733)
      	at io.netty.channel.DefaultChannelPipeline$HeadContext.flush(DefaultChannelPipeline.java:1127)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
      	at io.netty.channel.AbstractChannelHandlerContext.flush(AbstractChannelHandlerContext.java:644)
      	at io.netty.channel.ChannelDuplexHandler.flush(ChannelDuplexHandler.java:117)
      	at io.netty.channel.AbstractChannelHandlerContext.invokeFlush(AbstractChannelHandlerContext.java:663)
      	at io.netty.channel.AbstractChannelHandlerContext.access$1500(AbstractChannelHandlerContext.java:32)
      	at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:961)
      	at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:893)
      	at org.xnio.nio.WorkerThread.safeRun(WorkerThread.java:580)
      	at org.xnio.nio.WorkerThread.run(WorkerThread.java:464)
      
      Found 1 deadlock.
      

      Customer impact:
      Deadlock causes that all JMS clients cannot send/receive messages. EAP 7 server configured as live cannot be cleanly shutdown and must be killed.

      Note: There is another issue with replicated journal - JBEAP-3900 - "Split Brain issue with Replication" but this doesn not appear to be caused by it as there are NO log messages like:

      AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=49ed198c-eb59-11e5-86fb-d3a98519ea5e

              mtaylor1@redhat.com Martyn Taylor (Inactive)
              mnovak1@redhat.com Miroslav Novak
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: