We have test which does 10 times failover/failback in shared store HA scenario to be 100% sure that HA is robust. The test at first send 50 000 messages. During sending there is no crash of server. After that there is sequence of 10 kills and restarts of server 1 which is doing during receiving of messages.
When journal is located on NFS I see a lot of following warnings during initial sending of messages. During sending the test doesn't perform any crash or restart of EAP. Producer is not aware of any error. All operations performed by producer are successful.
05:07:13,527 WARN [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,454 during compact replay 05:07:13,530 WARN [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,457 during compact replay 05:07:13,530 WARN [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,460 during compact replay 05:07:13,530 WARN [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,463 during compact replay 05:07:13,530 WARN [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,466 during compact replay 05:07:13,530 WARN [org.apache.activemq.artemis.journal] (Thread-13 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@24331ae0-1686156632)) AMQ142007: Can not find record 61,469 during compact replay
05:08:06,221 WARN [org.apache.activemq.artemis.core.server] (Thread-10 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@7a9014c8-370821279)) AMQ222055: Error on deleting duplic ate cache: java.lang.IllegalStateException: Cannot find add info 45135 on compactor or current records at org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:937) [artemis-journal-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:822) [artemis-journal-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.deleteDuplicateID(AbstractJournalStorageManager.java:451) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redha t-1] at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl.addToCacheInMemory(DuplicateIDCacheImpl.java:263) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl.access$000(DuplicateIDCacheImpl.java:41) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.process(DuplicateIDCacheImpl.java:340) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.postoffice.impl.DuplicateIDCacheImpl$AddDuplicateIDOperation.afterCommit(DuplicateIDCacheImpl.java:348) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.afterCommit(TransactionImpl.java:546) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl.access$100(TransactionImpl.java:38) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.transaction.impl.TransactionImpl$3.done(TransactionImpl.java:321) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.core.persistence.impl.journal.OperationContextImpl$1.run(OperationContextImpl.java:236) [artemis-server-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:101) [artemis-commons-1.5.3.003-redhat-1.jar:1.5.3.003-redhat-1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_111] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_111] at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_111]
After producer finishes its job, receiver starts to receive messages. During receiving of messages server 1 is 10 times killed and restarted. Server 1 has shared store backup. When Artemis loads journal at server's start I see a lot of warnings listed below. I assume these warnings can be shown only if Artemis is killed during sending of messages.
At the end the test detects lost messages. I see that entire bulks of messages are lost. Producer commits each 100th message. When I tracked lost messages in server logs, I saw that they were successfully received by Artemis, paged and the transaction was committed.
05:12:06,359 WARN [org.apache.activemq.artemis.core.server] (Thread-17 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@504f2fba-1586027459)) AMQ222029: Could not locate page transaction 45,131, ignoring message on position PagePositionImpl [pageNr=2, messageNr=3119, recordID=-1] on address=jms.queue.testQueue queue=jms.queue.testQueue 05:12:06,361 WARN [org.apache.activemq.artemis.core.server] (Thread-17 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@504f2fba-1586027459)) AMQ222029: Could not locate page transaction 45,131, ignoring message on position PagePositionImpl [pageNr=2, messageNr=3120, recordID=-1] on address=jms.queue.testQueue queue=jms.queue.testQueue 05:12:06,361 WARN [org.apache.activemq.artemis.core.server] (Thread-17 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$3@504f2fba-1586027459)) AMQ222029: Could not locate page transaction 45,131, ignoring message on position PagePositionImpl [pageNr=2, messageNr=3121, recordID=-1] on address=jms.queue.testQueue queue=jms.queue.testQueue
This is regression against 7.0.z. The same test passes with 7.0.5. With EAP 7.0.5 I don't see neither mentioned warnings nor lost messages.
Customer impact: Paged messages can be lost if the journal is located on NFS disc. Since NFS disc is used with shared store HA, this blocks using of shared store HA on NFS.
- duplicates
-
JBEAP-9937 Lost messages in cluster with server crashes
- Closed
- is incorporated by
-
JBEAP-10454 Upgrade Artemis 1.5.4.jbossorg-003
- Closed
- relates to
-
ENTMQBR-585 AMQ7 replication failed
- Closed
-
JBEAP-9937 Lost messages in cluster with server crashes
- Closed
-
JBEAP-10475 IndexOutOfBoundsException during reading message if compacting is enabled
- Closed
-
ARTEMIS-1114 Loading...