Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-13599

(7.4.z) ResendCache in ChannelImpl has no upper limit and may cause OOM error

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Critical Critical
    • None
    • 7.1.0.CR3, 7.2.0.GA
    • ActiveMQ
    • None
    • Hide
      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      git checkout master
      groovy -DEAP_VERSION=7.1.0.CR3 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      
      cd ../jboss-hornetq-testsuite/
      
      mvn clean test -Dtest=ClusterTestRedistributionToExhaustedServerTestCase#testOOMWhenTargetServerRespondsSlowly -DfailIfNoTests=false -Deap=7x | tee log
      
      Show
      git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout master groovy -DEAP_VERSION=7.1.0.CR3 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ClusterTestRedistributionToExhaustedServerTestCase#testOOMWhenTargetServerRespondsSlowly -DfailIfNoTests= false -Deap=7x | tee log

      Scenario

      • There are two Artemis servers in cluster
      • Server 1 has messages in a queue but it has no consumer (source server)
      • Server 2 has consumer (target server)
      • Since target server has consumer, messages are redistributed from the source server to the target server
      • Target server responds slower that the source server sends messages. This may be caused by slow IO operations or by exhausted CPU.

      Expectation: All messages are redistributed from the source to target server and consumed by the consumer.

      Reality: Source server fails on OutOfMemory error.

      Customer impact: If messages are redistributing to slow or exhausted target server, the source server may fails on OutOfMemory error. This is not regression against previous EAP 6/7 versions thus setting lower priority.

      Detail description of the issue

      ClusterConnectionBridge is responsible for redistribution of messages. The producer, which is used in the implementation of bridge, sends messages in a non blocking way. It means that all sent packets are stored in ChannelImpl.resendCache where they are waiting until they are confirmed by the target server.

      private void addResendPacket(Packet packet) {
         resendCache.add(packet);
      
         if (logger.isTraceEnabled()) {
           logger.trace("ChannelImpl::addResendPacket adding packet " + packet + " stored commandID=" + firstStoredCommandID + " possible commandIDr=" + (firstStoredCommandID + resendCache.size()));
         }
      }
      
      private void clearUpTo(final int lastReceivedCommandID) {
        final int numberToClear = 1 + lastReceivedCommandID - firstStoredCommandID;
      
        if (logger.isTraceEnabled()) {
           logger.trace("ChannelImpl::clearUpTo lastReceived commandID=" + lastReceivedCommandID +
                            " first commandID=" + firstStoredCommandID +
                            " number to clear " + numberToClear);
         }
      
         for (int i = 0; i < numberToClear; i++) {
            final Packet packet = resendCache.poll();
      
            if (packet == null) {
                  ActiveMQClientLogger.LOGGER.cannotFindPacketToClear(lastReceivedCommandID, firstStoredCommandID);
               firstStoredCommandID = lastReceivedCommandID + 1;
               return;
            }
      
            if (logger.isTraceEnabled()) {
               logger.trace("ChannelImpl::clearUpTo confirming " + packet + " towards " + commandConfirmationHandler);
            }
            if (commandConfirmationHandler != null) {
               commandConfirmationHandler.commandConfirmed(packet);
            }
         }
      
         firstStoredCommandID += numberToClear;
      }
      

      The resendCache has no upper limit and it may cause OOM on the server. As I mentioned above, this may happen when the target server responds slower than the source server sends packets.

      Based on the documentation [1] I tried to limit number of in-flight packets using the producer-window-size which can be configured on cluster connection. Unfortunately this property is not taken into account [2].

      // No producer flow control on the bridges, as we don't want to lock the queues
      targetLocator.setProducerWindowSize(-1);
      

      Not sure what it is meant by the comment, but if there were more queues with block policy and one of them is full, it would block the rest of the queues, because the producer would be blocked on acquiring of credits. IMO the logic must be changed if we want to control the flow on cluster connection using the producer-window-size property.

      Another solution to deal with this issue is limiting size of the resendCache in ChannelImpl. If the cache has upper limit, it cannot cause OOM. However I am not sure how difficult it would be to implement it.

      [1] https://activemq.apache.org/artemis/docs/1.5.5/clusters.html#configuring-cluster-connections
      [2] https://github.com/rh-messaging/jboss-activemq-artemis/blob/1.5.5.jbossorg-008/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/cluster/impl/ClusterConnectionImpl.java#L784

              istudens@redhat.com Ivo Studensky
              eduda_jira Erich Duda (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: