Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-10062

Lost paged messages if journal is located on NFS

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Blocker Blocker
    • None
    • 7.1.0.DR15
    • ActiveMQ
    • None
    • Regression
    • Hide
      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      git checkout b17d5f547b72b0370248f9f13f87147db1ed9d87
      groovy -DEAP_VERSION=7.1.0.DR15 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      
      cd ../jboss-hornetq-testsuite/
      
      mvn clean test -Dtest=DedicatedFailoverTestCase#testMultipleFailoverReceiver -DfailIfNoTests=false -Deap=7x | tee log
      
      Show
      git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout b17d5f547b72b0370248f9f13f87147db1ed9d87 groovy -DEAP_VERSION=7.1.0.DR15 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=DedicatedFailoverTestCase#testMultipleFailoverReceiver -DfailIfNoTests= false -Deap=7x | tee log

      We have test which does 10 times failover/failback in shared store HA scenario to be 100% sure that HA is robust. The test at first send 50 000 messages. During sending there is no crash of server. After that there is sequence of 10 kills and restarts of server 1 which is doing during receiving of messages.

      When journal is located on NFS I see a lot of following warnings during initial sending of messages. During sending the test doesn't perform any crash or restart of EAP. Producer is not aware of any error. All operations performed by producer are successful.

      05:07:13,527 WARN  [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,454 during compact replay
      05:07:13,530 WARN  [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,457 during compact replay
      05:07:13,530 WARN  [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,460 during compact replay
      05:07:13,530 WARN  [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,463 during compact replay
      05:07:13,530 WARN  [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,466 during compact replay
      05:07:13,530 WARN  [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,469 during compact replay
      
      05:08:06,221 WARN  [org.apache.activemq.artemis.core.server] (Thread-10 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@7a9014c8-370821279)) AMQ222055: Error on deleting duplic
      ate cache: java.lang.IllegalStateException: Cannot find add info 45135 on compactor or current records
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:937) [artemis-journal-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:822) [artemis-journal-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.deleteDuplicateID(AbstractJournalStorageManager.java:451) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redha
      t-1]
              at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl.addToCacheInMemory(DuplicateIDCacheImpl.java:263) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl.access$000(DuplicateIDCacheImpl.java:41) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.process(DuplicateIDCacheImpl.java:340) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.afterCommit(DuplicateIDCacheImpl.java:348) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:546) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$100(TransactionImpl.java:38) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$3.done(TransactionImpl.java:321) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:236) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:101) [artemis-commons-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_111]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_111]
              at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_111]
      

      After producer finishes its job, receiver starts to receive messages. During receiving of messages server 1 is 10 times killed and restarted. Server 1 has shared store backup. When Artemis loads journal at server's start I see a lot of warnings listed below. I assume these warnings can be shown only if Artemis is killed during sending of messages.

      At the end the test detects lost messages. I see that entire bulks of messages are lost. Producer commits each 100th message. When I tracked lost messages in server logs, I saw that they were successfully received by Artemis, paged and the transaction was committed.

      05:12:06,359 WARN  [org.apache.activemq.artemis.core.server] (Thread-17 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@504f2fba-1586027459)) AMQ222029: Could not locate page transaction 45,131, ignoring message on position PagePositionImpl [pageNr=2, messageNr=3119, recordID=-1] on address=jms.queue.testQueue queue=jms.queue.testQueue
      05:12:06,361 WARN  [org.apache.activemq.artemis.core.server] (Thread-17 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@504f2fba-1586027459)) AMQ222029: Could not locate page transaction 45,131, ignoring message on position PagePositionImpl [pageNr=2, messageNr=3120, recordID=-1] on address=jms.queue.testQueue queue=jms.queue.testQueue
      05:12:06,361 WARN  [org.apache.activemq.artemis.core.server] (Thread-17 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@504f2fba-1586027459)) AMQ222029: Could not locate page transaction 45,131, ignoring message on position PagePositionImpl [pageNr=2, messageNr=3121, recordID=-1] on address=jms.queue.testQueue queue=jms.queue.testQueue
      

      This is regression against 7.0.z. The same test passes with 7.0.5. With EAP 7.0.5 I don't see neither mentioned warnings nor lost messages.

      Customer impact: Paged messages can be lost if the journal is located on NFS disc. Since NFS disc is used with shared store HA, this blocks using of shared store HA on NFS.

              rh-ee-ataylor Andy Taylor
              eduda_jira Erich Duda (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: