-
Bug
-
Resolution: Done
-
Major
-
AMQ 7.2.2.GA
-
None
-
-
+
-
Verified in a release
The broker went down in production under high load.
At 11:23 the master under high load started paging and soon after went unresponsive:
2019-07-10 11:23:25,338 WARN [org.apache.activemq.artemis.core.server] AMQ222038: Starting paging on address 'com_ig_trade_v0_legacy_order_history'; size is currently: 104,858,652 bytes; max-size-bytes: 104,857,600; global-size-bytes: 201,240,334 2019-07-10 11:57:45,862 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: syscall:read(..) failed: Connection reset by peer [code=GENERIC_EXCEPTION] 2019-07-10 12:16:42,962 WARN [org.apache.activemq.artemis.core.server] AMQ222109: Timed out waiting for write lock on consumer. Check the Thread dump 2019-07-10 12:16:44,527 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: ...
Then, the slave became live (split-brain) and started paging and crashed as well:
2019-07-10 12:16:12,908 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ219015: The connection was disconnected because of server shutdown [code=DISCONNECTED] 2019-07-10 12:16:12,937 INFO [org.apache.activemq.artemis.core.server] AMQ221071: Failing over based on quorum vote results. 2019-07-10 12:16:16,599 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live 2019-07-10 12:16:20,116 WARN [org.apache.activemq.artemis.core.server] AMQ222038: Starting paging on address 'com_ig_trade_v0_close_out_monitor_account_state'; size is currently: 104,860,066 bytes; max-size-bytes: 104,857,600; global-size-bytes: 221,121,610 2019-07-10 12:17:07,971 WARN [org.apache.activemq.artemis.core.server] AMQ222109: Timed out waiting for write lock on consumer. Check the Thread dump 2019-07-10 12:17:11,759 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: ...
Looking at the master's thread dump produced soon after paging started, there seems to be a deadlock between thread Id=685613 (WATING) thread Id=2497666 (BLOCKED) which in turns is blocking other 161 threads thus causing the failover. This deadlock may be caused by the lock on method PagingStoreImpl.isPaging [1] that has been removed in Artemis 2.7 (AMQ 7.3) among other paging improvements [2]. No reproducer available at the moment. Other improvements on the paging system are present the upcoming 7.4 release, so it could be an upgrade worth taking.
"Thread-18796 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@7dcf96dc)" Id=2497666 BLOCKED on org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl$CursorIterator@4b31ac0e owned by "Thread-23 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@3fb8a940)" Id=685613 at org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl$CursorIterator.hasNext(PageSubscriptionImpl.java:1332) - blocked on org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl$CursorIterator@4b31ac0e at org.apache.activemq.artemis.core.server.impl.QueueImpl.checkDepage(QueueImpl.java:2417) at org.apache.activemq.artemis.core.server.impl.QueueImpl.deliverAsync(QueueImpl.java:806) at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.deliverIfNecessary(PageCursorProviderImpl.java:614) at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl.cleanup(PageCursorProviderImpl.java:367) - locked org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl@58cba2dd at org.apache.activemq.artemis.core.paging.cursor.impl.PageCursorProviderImpl$1.run(PageCursorProviderImpl.java:290) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$4/1107217291.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) Number of locked synchronizers = 2 - java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2fc0e8d9 - java.util.concurrent.ThreadPoolExecutor$Worker@52572922 "Thread-23 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@3fb8a940)" Id=685613 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2fc0e8d9 owned by "Thread-18796 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@7dcf96dc)" Id=2497666 at sun.misc.Unsafe.park(Native Method) - waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2fc0e8d9 at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727) at org.apache.activemq.artemis.core.paging.impl.PagingStoreImpl.isPaging(PagingStoreImpl.java:278) at org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl$CursorIterator.hasNext(PageSubscriptionImpl.java:1336) - locked org.apache.activemq.artemis.core.paging.cursor.impl.PageSubscriptionImpl$CursorIterator@4b31ac0e at org.apache.activemq.artemis.core.server.impl.QueueImpl.checkDepage(QueueImpl.java:2417) at org.apache.activemq.artemis.core.server.impl.QueueImpl.deliverAsync(QueueImpl.java:806) at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.forceDelivery(ServerConsumerImpl.java:1093) at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.promptDelivery(ServerConsumerImpl.java:1085) at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.forceDelivery(ServerConsumerImpl.java:591) - locked org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl@184f2bb9 at org.apache.activemq.artemis.core.server.impl.ServerConsumerImpl.forceDelivery(ServerConsumerImpl.java:578) at org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.forceConsumerDelivery(ServerSessionImpl.java:858) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:590) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287) at org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$65/1826790006.onMessage(Unknown Source) at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$4/1107217291.run(Unknown Source) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) at org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$4/1107217291.run(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) Number of locked synchronizers = 1 - java.util.concurrent.ThreadPoolExecutor$Worker@3af247e0
[1] https://github.com/apache/activemq-artemis/blob/2.6.4/artemis-server/src/main/java/org/apache/activemq/artemis/core/paging/impl/PagingStoreImpl.java#L278
[2] https://issues.apache.org/jira/browse/ARTEMIS-2216