We use a range of HornetQ versions including 2.2.5, and 2.2.11 (as shipped with JBoss app server), and in our robustness tests we are regularly seeing a serious issue that results in our JMS message producers hanging (indefinitely) in the send() call.
This seems to happen some time after a different producer that was sending messages got uncleanly disconnected (due to killing the sender process or simulating network connection failure). We see this with durable topics as well as queues (both using using persistent messages; we use 1 JMS session object per consumer). The stack trace always shows that we're stuck in the acquireCredits() call.
We're using the default unclustered configuration files that come with the HornetQ distribution, i.e. address-full-policy=BLOCK (with a max size of 10MB). After seeing the problem, I started running with client-failure-check-period=connection-ttl=8000ms to ensure dead connections get cleaned up as quickly as possible, which didn't make much difference.
Note that we have message consumers listening to the topic and queue we use at all times (never disconnected or killed), so there is really no reason for flow control to cause the senders to block for a long time. The behaviour I'd expect to see is that after the connection-ttl timeout expires, any producer flow control credits allocated to dead producers would be returned to the available pool, any producers whose connection went down would throw an exception and get cleaned up, and any new producers replacing them would be able to acquire credits and keep sending. However, we end up blocking in acquireCredits forever.
Unfortunately the ClientProducerCreditsImpl/ClientProducerCreditManagerImpl have no logging so I downloaded the source and hacked log messages in to work out what is happening. There appears to be a race in which the credit manager can call close (which closes its child producers) but the manager may immediately afterwards create a new CreditsImpl, whose associated session is already closed and therefore unusable - specifically, session.isClosed()=true in the ClientProducerCreditsImpl constructor. I don't believe the code should permit this to happen, because the CreditsImpl is guaranteed to then never get notified about the closed session, and therefore acquireCredits will block forever, causing the customer/client application to hang.
To check my hypothesis I wrote some hacky code with a watchdog thread that waits until the producer thread is blocked in ClientProducerCreditsImpl for a long time, and then interrupts it - this as soon as the interruption happens, the code path goes back into the session and promptly throws an exception, which allows our client/customer code to recover. This acts as a workaround but it's pretty horrible so we'd like to see this fixed asap if possible.
To fix, I would suggest the credits manager or session should be changed to never instantiate a ClientProducerCreditsImpl object after close() has been called (in a thread-safe way - and it would be good to add an assertion into the constructor to confirm this).
[Also (less important) you might consider using semaphore.tryAcquire(credits, <small finite time>, TimeUnit.SECONDS) instead of .acquire() so that if something does go unexpectedly wrong we have a change to log a warning message, re-request credits, check if the session was closed, etc]