Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-3437

AMQP consumption stalls under during high message throughput

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: None
    • Fix Version/s: AMQ 7.4.4.CR1
    • Component/s: None
    • Target Release:
    • Release Notes Text:
      Hide
      Previously, if a broker in a live-backup group configured to use the replication high-availability (HA) policy experienced a very high throughput of AMQP messages (that is, a rate of 10,000 or more messages per second) message consumption might stall. In addition, the broker and clients might display errors such as "Could not complete operations" (broker) and "Remote did not respond to a drain request in time" (client). This issue is now resolved.
      Show
      Previously, if a broker in a live-backup group configured to use the replication high-availability (HA) policy experienced a very high throughput of AMQP messages (that is, a rate of 10,000 or more messages per second) message consumption might stall. In addition, the broker and clients might display errors such as "Could not complete operations" (broker) and "Remote did not respond to a drain request in time" (client). This issue is now resolved.
    • Release Notes Docs Status:
      Documented as Resolved Issue
    • QE Test Coverage:
      +
    • Upstream Jira:
      ARTEMIS-2642
    • Verified:
      Verified in a release

      Description

      This problem occurs when AMQP clients publish/subscribe messages from the broker at rates in excess of 10,000 messages per second. It has been seen with QPid JMS and the .NET AMQP client, but never seen with Artemis Core JMS or OpenWire clients. We therefore think that the problem is related to the broker's AMQP support.

      A variety of different failure modes can be reproduced, using a test case that publishes from twenty concurrent threads, and consumes from twenty concurrent threads, with consumers arranged into shared consumers in groups of five. It is not clear at this stage whether the subscription mode is significant or not. In addition, although the problem has so far only been seen in a master-slave, shared-nothing replication set-up, it isn't clear whether the topology is relevant or not.

      All the failure modes have in common that message consumption stops, even though new messages are being published. There may, or may not, be exceptions in the broker or the client.

      In some cases we see this message in the broker:

      2020-03-10 00:15:33,115 WARN  [org.apache.activemq.artemis.core.server] AMQ222024: Could not complete operations on IO context OperationContextImpl [742496659] [minimalStore=0, storeLineUp=0, stored=0, minimalRe
      plicated=1, replicationLineUp=1, replicated=0, paged=0, minimalPage=0, pageLineUp=0, errorCode=-1, errorMessage=null, executorsPending=0, executor=OrderedExecutor(tasks=[])]Task = TaskHolder [storeLined=0, repli
      cationLined=1, pageLined=0, task=IOCallback(PageSubscriptionImpl) ]
      Task = TaskHolder [storeLined=0, replicationLined=1, pageLined=0, task=IOCallback(PageSubscriptionImpl) ]
      Task = TaskHolder [storeLined=0, replicationLined=1, pageLined=0, task=IOCallback(PageSubscriptionImpl) ]
      Task = TaskHolder [storeLined=0, replicationLined=1, pageLined=0, task=IOCallback(PageSubscriptionImpl) ]
      Task = TaskHolder [storeLined=0, replicationLined=1, pageLined=0, task=IOCallback(PageSubscriptionImpl) ]
      Task = TaskHolder [storeLined=0, replicationLined=1, pageLined=0, task=org.apache.activemq.artemis.core.journal.impl.SimpleWaitIOCallback]
      

      In some cases we see this exception in the client:

      Caused by: org.apache.qpid.jms.provider.exceptions.ProviderOperationTimedOutException: Remote did not respond to a drain request in time
              at org.apache.qpid.jms.provider.amqp.AmqpConsumer.lambda$stop$1(AmqpConsumer.java:179) ~[qpid-jms-client-0.48.0.jar!/:na]
              at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[netty-common-4.1.39.Final.jar!/:4.1.39.Final]
              at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:127) ~[netty-common-4.1.39.Final.jar!/:4.1.39.Final]
              at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[netty-common-4.1.39.Final.jar!/:4.1.39.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416) ~[netty-common-4.1.39.Final.jar!/:4.1.39.Final]
              at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:331) ~[netty-transport-native-epoll-4.1.39.Final-linux-x86_64.jar!/:4.1.39.Final]
              at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918) ~[netty-common-4.1.39.Final.jar!/:4.1.39.Final]
              at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.39.Final.jar!/:4.1.39.Final]
      

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              clebert.suconic Clebert Suconic
              Reporter:
              kboone Kevin Boone
              Tester:
              Tiago Bueno Tiago Bueno
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: