Set up two AMQ 7,7 brokers in a in a non-replicated, fully-connected mesh. To make the problem easier to reproduce, I set -Xmx for both brokers to 512Mb – just so they run out of heap quickly.
Here is the test topology:
+-----------+ +------------+
publisher client-->| upstream |--> | downstream | --> temporary subscriber
| broker | | broker |
+-----------+ +------------+
I'm using the term "downstream" for the broker that the subscriber will connect to, and "upstream" for the broker to which the publisher will connect. Of course, this is a symmetric cluster, so the roles are interchangeable.
Use JMS consumer to connect a durable consumer to downstream, on a specific topic, with a specific client ID.
Use a JMS publisher to publish a message to the same topic, on upstream. Verify that the consumer receives the message.
Disconnect the consumer.
Use a JMS publisher to publish, say, unlimited 50kB messages to the same topic. The larger the messages, the quicker the problem reproduces.
After a few hundred kB of messages, downstream fails with an OOM error with no stack backtrace. Thereafter, it is effectively dead.
Upstream starts logging messages saying it can't connect to downstream. This is expected.
Restart downstream – this requires a 'kill -9' in my tests.
Almost immediately, upstream fails with an OOM – again no backtrace. Upstream is now effectively dead, and has to be restarted.
Note that paging does start on downstream – this can be seen in the log. A heap dump from downstream after failure shows the heap completely full of message-related objects – the exact class depends on the wire protocol in use.
I can reproduce this problem with AMQP and Core clients. I have observed other problems with durable consumers and AMQP, but this one does not seem to depend on protocol.