-
Bug
-
Resolution: Done
-
Major
-
JBoss A-MQ 6.1
-
None
A broker is configured to forward messages to two other brokers, using a network connector specified using a masterslave: URL. Messages placed on the upstream broker are forwarded to one of the two downstream brokers, according to the order specified in the URL. When one of the downstream brokers fails or is shut down (such that the JVM is no longer running), then the upstream broker detects the failure immediately, and switches to using the other broker.
However, when there is a failure of network connectivity between the upstream and active downstream broker, then the upstream broker does not always respond correctly to the failure. It detects the failure, because we see a message in the log if the log level is high enough:
17:10:01,126 | DEBUG | r ReadCheckTimer | AbstractInactivityMonitor | 131 - org.apache.activemq.activemq-osgi - 5.9.0.redhat-610394 | No message received since last read check for tcp:///10.5.1.17:61617@15034. Throwing InactivityIOException.
However, the upstream broker does not always switch over.
The problem is not fully reproducible. It seems that with the default prefetch size on the network connector (1000) messages it is reproducible so long as there is a continuous flow of messages through the installation. If the message flow is slower, or bursty, or the prefetch is set to a smaller number, then it is less reproducible. With a prefetch size of 1, it does not appear to be reproducible at all in my tests.
However, the problem is not simply that one downstream broker has prefetched its quota and then gone away, leaving no messages for the other. I can put tens of thousands of messages on the upstream broker, and see no attempt to forward anything to the downstream broker that is still running.
Most bizarrely of all, sometimes I can stop a downstream broker, wait a minute, and then restart it – and only then see the upstream broker fail over to the other downstream brokers. It's as if the upstream broker knows it has to fail over (because we see the message in the log), but some network operation against the disconnected downstream broker is blocking it for some reason.