Loading...

Type: Bug
Resolution: Not a Bug
Priority: Critical
Fix Version/s: None
Affects Version/s: 7.1.0.CR3, 7.2.0.GA
Component/s: ActiveMQ
Labels:
None

CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Target Release:

7.4.z.GA
Git Pull Request:
https://github.com/rh-messaging/jboss-activemq-artemis/pull/247
Steps to Reproduce:
Hide

git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout master groovy -DEAP_VERSION=7.1.0.CR3 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ClusterTestRedistributionToExhaustedServerTestCase#testOOMWhenTargetServerRespondsSlowly -DfailIfNoTests=false -Deap=7x | tee log
Show
git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout master groovy -DEAP_VERSION=7.1.0.CR3 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ClusterTestRedistributionToExhaustedServerTestCase#testOOMWhenTargetServerRespondsSlowly -DfailIfNoTests= false -Deap=7x | tee log

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Scenario

There are two Artemis servers in cluster
Server 1 has messages in a queue but it has no consumer (source server)
Server 2 has consumer (target server)
Since target server has consumer, messages are redistributed from the source server to the target server
Target server responds slower that the source server sends messages. This may be caused by slow IO operations or by exhausted CPU.

Expectation: All messages are redistributed from the source to target server and consumed by the consumer.

Reality: Source server fails on OutOfMemory error.

Customer impact: If messages are redistributing to slow or exhausted target server, the source server may fails on OutOfMemory error. This is not regression against previous EAP 6/7 versions thus setting lower priority.

Detail description of the issue

ClusterConnectionBridge is responsible for redistribution of messages. The producer, which is used in the implementation of bridge, sends messages in a non blocking way. It means that all sent packets are stored in ChannelImpl.resendCache where they are waiting until they are confirmed by the target server.

private void addResendPacket(Packet packet) {
   resendCache.add(packet);

   if (logger.isTraceEnabled()) {
     logger.trace("ChannelImpl::addResendPacket adding packet " + packet + " stored commandID=" + firstStoredCommandID + " possible commandIDr=" + (firstStoredCommandID + resendCache.size()));
   }
}

private void clearUpTo(final int lastReceivedCommandID) {
  final int numberToClear = 1 + lastReceivedCommandID - firstStoredCommandID;

  if (logger.isTraceEnabled()) {
     logger.trace("ChannelImpl::clearUpTo lastReceived commandID=" + lastReceivedCommandID +
                      " first commandID=" + firstStoredCommandID +
                      " number to clear " + numberToClear);
   }

   for (int i = 0; i < numberToClear; i++) {
      final Packet packet = resendCache.poll();

      if (packet == null) {
            ActiveMQClientLogger.LOGGER.cannotFindPacketToClear(lastReceivedCommandID, firstStoredCommandID);
         firstStoredCommandID = lastReceivedCommandID + 1;
         return;
      }

      if (logger.isTraceEnabled()) {
         logger.trace("ChannelImpl::clearUpTo confirming " + packet + " towards " + commandConfirmationHandler);
      }
      if (commandConfirmationHandler != null) {
         commandConfirmationHandler.commandConfirmed(packet);
      }
   }

   firstStoredCommandID += numberToClear;
}

The resendCache has no upper limit and it may cause OOM on the server. As I mentioned above, this may happen when the target server responds slower than the source server sends packets.

Based on the documentation [1] I tried to limit number of in-flight packets using the producer-window-size which can be configured on cluster connection. Unfortunately this property is not taken into account [2].

// No producer flow control on the bridges, as we don't want to lock the queues
targetLocator.setProducerWindowSize(-1);

Not sure what it is meant by the comment, but if there were more queues with block policy and one of them is full, it would block the rest of the queues, because the producer would be blocked on acquiring of credits. IMO the logic must be changed if we want to control the flow on cluster connection using the producer-window-size property.

Another solution to deal with this issue is limiting size of the resendCache in ChannelImpl. If the cache has upper limit, it cannot cause OOM. However I am not sure how difficult it would be to implement it.

[1] https://activemq.apache.org/artemis/docs/1.5.5/clusters.html#configuring-cluster-connections
[2] https://github.com/rh-messaging/jboss-activemq-artemis/blob/1.5.5.jbossorg-008/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/cluster/impl/ClusterConnectionImpl.java#L784

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

producer-window-size-50000-01.patch
4 kB
2018/11/20 4:22 AM

blocks

ENTMQBR-1176 FLow control improvements to the bridge

Closed

is cloned by

ENTMQBR-2148 Provider-window-size causes queue to be not responsive

Closed

is related to

JBEAP-13613 cluster-connection.producer-window-size property is ignored by Artemis

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates