I was able to reproduce this in a local environment by setting up a relatively vanilla 2-node broker cluster (not sure this is strictly necessary). Memory was set to 4G min/ max on both brokers. To see the issue in the logs, enable trace logging for mqtt:
logger.org.apache.activemq.artemis.core.protocol.mqtt.level=TRACE
To reproduce, set up the broker cluster, then extract the attached consumer and producer applications locally.
On the node1 broker (or even both), try modifying the network to introduce some packet delay and loss (as root):
[root@node1 ~]# tc qdisc add dev eth0 root netem delay 600ms 200ms loss 5% 25% distribution normal
This may not be strictly necessary as I was able to reproduce the issue once or twice without it, but reproduction was more consistent with it. If possible, use a multi-homed host for this, so the broker can be configured to use the modified interface, while ssh / scp can use the other interface.
1. Modify the consumer jndi.properties to update the remote.address property to correspond to the node2 broker in the cluster and start the consumer.
2. modify the producer jndi.properties to update the remote.address property to correspond to the node1 broker in the cluster and start the producer.
3. Wait until the producer seems to stop producing messages. If the producer exits cleanly, it probably means the issue didn't reproduce.
4. Check the producer log using grep to check for equal / unequal PUBLISH vs. PUBREC events:
cat ../log/artemis.log | grep PUBLISH | grep \ IN\ \<\< | wc -l
cat ../log/artemis.log | grep PUBREC | grep \ OUT\ \>\> | wc -l
If these numbers are unequal, there should be a hung thread in the producer application. If the broker is restarted, the producer should resume production and finish the batch.