Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-12681

Initial replication may fail because of incorrect packet ordering

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • 7.1.0.CR1
    • 7.1.0.ER3
    • ActiveMQ
    • None
    • Regression, Blocks Testing
    • Hide
      git clone https://github.com/rh-messaging/jboss-activemq-artemis
      cd jboss-activemq-artemis
      git checkout 1.5.5.jbossorg-006
      mvn install -Ptests -Dtest=ReplicatedFailoverTest#testTimeoutOnFailover -Drat.ignoreErrors=true -DfailIfNoTests=false | tee log
      
      Show
      git clone https: //github.com/rh-messaging/jboss-activemq-artemis cd jboss-activemq-artemis git checkout 1.5.5.jbossorg-006 mvn install -Ptests -Dtest=ReplicatedFailoverTest#testTimeoutOnFailover -Drat.ignoreErrors= true -DfailIfNoTests= false | tee log

      Scenario: The issue occurs in all replication scenarios during initial synchronization.
      Customer impact: Initial replication between live and backup may fail and hence the replication won't work.

      We see this issue only in Artemis upstream test suite. We haven't seen it in EAP tests.
      Although EAP failover tests didn't hit this issue, there is still a risk that the issue may arise in the production so the blocker priority was set.

      This is regression against 7.0.z.

      Detail description of the issue
      The following NullPointerException arises in almost all replication tests in upstream Artemis test suite.

      *** [Thread-1 (org.apache.activemq.artemis.utils.ActiveMQThreadFactory)] ***
      08:11:01,702 WARN  [org.apache.activemq.artemis.core.replication.ReplicationEndpoint] null: java.lang.NullPointerException
      	at org.apache.activemq.artemis.core.replication.ReplicationEndpoint.handleReplicationSynchronization(ReplicationEndpoint.java:444) [artemis-server-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.core.replication.ReplicationEndpoint.handlePacket(ReplicationEndpoint.java:196) [artemis-server-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.handlePacket(ChannelImpl.java:633) [artemis-core-client-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.doBufferReceived(RemotingConnectionImpl.java:379) [artemis-core-client-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.core.protocol.core.impl.RemotingConnectionImpl.bufferReceived(RemotingConnectionImpl.java:362) [artemis-core-client-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.core.client.impl.ClientSessionFactoryImpl$DelegatingBufferHandler.bufferReceived(ClientSessionFactoryImpl.java:1143) [artemis-core-client-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.core.remoting.impl.invm.InVMConnection$1.run(InVMConnection.java:196) [artemis-server-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:118) [artemis-commons-1.5.5.jbossorg-006.jar:1.5.5.jbossorg-006]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1153) [rt.jar:1.8.0]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [rt.jar:1.8.0]
      	at java.lang.Thread.run(Thread.java:785) [vm.jar:2.6 (05-16-2017)]
      

      I found out that the issue is caused by incorrect ordering of replication packets. The NPE arises when ReplicationSyncFileMessage packets are sent before ReplicationStartSyncMessage packets.

      Incorrect ordering of replication packets may happen because of useExecutor parameter in the sendReplicatePacket method. ReplicationStartSyncMessage packets are sent as first, but they are sent with useExecutor=true. Although ReplicationSyncFileMessage packets are sent after ReplicationStartSyncMessage packets, they are sent with useExecutor=false. So sending of ReplicationStartSyncMessage packets is scheduled to executor and there is no guarantee when the task will be executed, whereas ReplicationStartSyncMessage packets are sent immediately.

      private OperationContext sendReplicatePacket(final Packet packet, boolean lineUp, boolean useExecutor) {
            if (!enabled)
               return null;
            boolean runItNow = false;
      
            final OperationContext repliToken = OperationContextImpl.getContext(executorFactory);
            if (lineUp) {
               repliToken.replicationLineUp();
            }
      
            if (enabled) {
               if (useExecutor) {
                  replicationStream.execute(() -> {
                     if (enabled) {
                        pendingTokens.add(repliToken);
                        flowControl(packet.expectedEncodeSize());
                        replicatingChannel.send(packet);
                     }
                  });
               } else {
                  pendingTokens.add(repliToken);
                  flowControl(packet.expectedEncodeSize());
                  replicatingChannel.send(packet);
               }
            } else {
               // Already replicating channel failed, so just play the action now
               runItNow = true;
            }
      
            // Execute outside lock
      
            if (runItNow) {
               repliToken.replicationDone();
            }
      
            return repliToken;
         }
      

            rhn-support-jbertram Justin Bertram
            eduda_jira Erich Duda (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: