-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
7.1.0.CR3, 7.2.0.GA
-
None
Scenario
- There are two Artemis servers in cluster
- Server 1 has messages in a queue but it has no consumer (source server)
- Server 2 has consumer (target server)
- Since target server has consumer, messages are redistributed from the source server to the target server
- Target server responds slower that the source server sends messages. This may be caused by slow IO operations or by exhausted CPU.
Expectation: All messages are redistributed from the source to target server and consumed by the consumer.
Reality: Source server fails on OutOfMemory error.
Customer impact: If messages are redistributing to slow or exhausted target server, the source server may fails on OutOfMemory error. This is not regression against previous EAP 6/7 versions thus setting lower priority.
Detail description of the issue
ClusterConnectionBridge is responsible for redistribution of messages. The producer, which is used in the implementation of bridge, sends messages in a non blocking way. It means that all sent packets are stored in ChannelImpl.resendCache where they are waiting until they are confirmed by the target server.
private void addResendPacket(Packet packet) { resendCache.add(packet); if (logger.isTraceEnabled()) { logger.trace("ChannelImpl::addResendPacket adding packet " + packet + " stored commandID=" + firstStoredCommandID + " possible commandIDr=" + (firstStoredCommandID + resendCache.size())); } } private void clearUpTo(final int lastReceivedCommandID) { final int numberToClear = 1 + lastReceivedCommandID - firstStoredCommandID; if (logger.isTraceEnabled()) { logger.trace("ChannelImpl::clearUpTo lastReceived commandID=" + lastReceivedCommandID + " first commandID=" + firstStoredCommandID + " number to clear " + numberToClear); } for (int i = 0; i < numberToClear; i++) { final Packet packet = resendCache.poll(); if (packet == null) { ActiveMQClientLogger.LOGGER.cannotFindPacketToClear(lastReceivedCommandID, firstStoredCommandID); firstStoredCommandID = lastReceivedCommandID + 1; return; } if (logger.isTraceEnabled()) { logger.trace("ChannelImpl::clearUpTo confirming " + packet + " towards " + commandConfirmationHandler); } if (commandConfirmationHandler != null) { commandConfirmationHandler.commandConfirmed(packet); } } firstStoredCommandID += numberToClear; }
The resendCache has no upper limit and it may cause OOM on the server. As I mentioned above, this may happen when the target server responds slower than the source server sends packets.
Based on the documentation [1] I tried to limit number of in-flight packets using the producer-window-size which can be configured on cluster connection. Unfortunately this property is not taken into account [2].
// No producer flow control on the bridges, as we don't want to lock the queues
targetLocator.setProducerWindowSize(-1);
Not sure what it is meant by the comment, but if there were more queues with block policy and one of them is full, it would block the rest of the queues, because the producer would be blocked on acquiring of credits. IMO the logic must be changed if we want to control the flow on cluster connection using the producer-window-size property.
Another solution to deal with this issue is limiting size of the resendCache in ChannelImpl. If the cache has upper limit, it cannot cause OOM. However I am not sure how difficult it would be to implement it.
[1] https://activemq.apache.org/artemis/docs/1.5.5/clusters.html#configuring-cluster-connections
[2] https://github.com/rh-messaging/jboss-activemq-artemis/blob/1.5.5.jbossorg-008/artemis-server/src/main/java/org/apache/activemq/artemis/core/server/cluster/impl/ClusterConnectionImpl.java#L784
- blocks
-
ENTMQBR-1176 FLow control improvements to the bridge
- Closed
- is cloned by
-
ENTMQBR-2148 Provider-window-size causes queue to be not responsive
- Closed
- is related to
-
JBEAP-13613 cluster-connection.producer-window-size property is ignored by Artemis
- Closed