Messages become "stuck" in being-delivered state when clients use a clustered XA connection factory in a cluster of at least 2 nodes.
-2 nodes of JBoss EAP 4.3 CP02
-commented out "ClusterPullConnectionFactory" in messaging-service.xml to prevent message redistribution and eliminate the "message suckers" as the potential culprit
-MySQL backend using the default mysql-persistence-service.xml (from <JBOSS_HOME>/docs/examples/jms)
-both nodes have a client which is a separate process (i.e. not inside JBoss)
-clients are Spring based
-one client produces and consumes, the other client just consumes
-both clients use the ClusteredXAConnectionFactory from the default connection-factories-service.xml
-both clients publish to and consume from "queue/testDistributedQueue"
-clients are configured to send persistent messages, use AUTO_ACKNOWLEDGE, and transacted sessions
Symptoms of the issue:
-when running the clients I watch the JMX-Console for the "queue/testDistributedQueue"
-as the consumers pull messages off the queue I can see the MessageCount and DeliveringCount go to 0 every so often
-after a period of time (usually a few hours) the MessageCount and DeliveringCount never go back to 0
-I "kill" the clients and wait for the DeliveringCount to go to 0, but it never does
-after the clients are killed the ConsumerCount for the queue will drop, but never to 0 when messages are "stuck"
-a thread dump reveals at least one JBM server session that is apparently stuck (it never goes away) - ostensibly this is the consumer that is showing in the JMX-Console for "queue/testDistributedQueue"
-a "killall -3 java" doesn't produce anything from the clients so I know their dead
-nothing is in any DLQ or expiry queue
-the database contains as many rows in the JBM_MSG and JBM_MSG_REF tables as the DeliveringCount in the JMX-Console
-rebooting the node with the stuck messages frees the messages to be consumed (i.e. un-sticks them)
-nothing else is happening on either node but running the client and running JBoss
-this only appears to happen when a clustered connection factory is used. I tested using a normal connection factory and after 24 hours couldn't reproduce a stuck message.