There is intermittent issue in rollback of JMS session after failover/failback which can leave JMS session stopped.
Customer impact: Consumer will have stopped session after failover/failback which prevents receiving of new messages. Customer application will starve for messages even though there are messages in queue.
Tests scenario:
Start two EAP 7.1.0.CR1 servers in dedicated HA topology with queue testQueue (live and backup)
Start 3 producers and 1 consumer on live server sending/receiving messages from testQueue in transacted session.
50 times kill and start live server which causes clients to 50 times failover and failback
Stop producers and wait for consumer to receive all messages
Check that number of send and received is the same. No losses or duplicates
Result:
In failed test, after 13 failovers, there was issue during failback where consumer ends with stopped session and thus does not consume any new messages.
Investigation:
There is problem that consumer which received large message from backup sends ack to live server which results in javax.jms.IllegalStateException:
09:46:16,123 Thread-1956 ERROR [org.jboss.qa.hornetq.apps.clients.Receiver11:161] RETRY receive for host: 127.0.0.1, Trying to receive message with count: 23804 javax.jms.IllegalStateException: AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,683 queue = jms.queue.testQueue0 at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:404) at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:315) at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendACK(ActiveMQSessionContext.java:467) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.acknowledge(ClientSessionImpl.java:746) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.doAck(ClientConsumerImpl.java:1105) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.acknowledge(ClientConsumerImpl.java:764) at org.apache.activemq.artemis.core.client.impl.ClientMessageImpl.acknowledge(ClientMessageImpl.java:96) at org.apache.activemq.artemis.core.client.impl.ClientMessageImpl.acknowledge(ClientMessageImpl.java:38) at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.getMessage(ActiveMQMessageConsumer.java:241) at org.apache.activemq.artemis.jms.client.ActiveMQMessageConsumer.receive(ActiveMQMessageConsumer.java:132) at org.jboss.qa.hornetq.apps.clients.Receiver11.receiveMessage(Receiver11.java:153) at org.jboss.qa.hornetq.apps.clients.ReceiverTransAck.run(ReceiverTransAck.java:83) Caused by: ActiveMQIllegalStateException[errorType=ILLEGAL_STATE message=AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,683 queue = jms.queue.testQueue0] ... 12 more
This is because live server has no knowledge of messages being send to client by backup server before failback. This behaviour is still ok as client logic will simply retry receive of message.
Exception above causes that JMS transaction in which client consumed messages is marked as "rollback only" and when client calls session.commit() then internal Artemis client logic is actually doing rollback. This however results in another javax.jms.IllegalStateException instead of javax.jms.TransactionRollbackException and leaves session stopped:
09:46:16,139 Thread-1956 ERROR [org.jboss.qa.hornetq.apps.clients.Receiver11:64] Receiver - COMMIT FAILED - JMSException thrown during commit: AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,659 queue = jms. queue.testQueue0. Receiver for node: 127.0.0.1. Received message - count: 23805, retrying receive javax.jms.IllegalStateException: AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,659 queue = jms.queue.testQueue0 at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:404) at org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:315) at org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sendACK(ActiveMQSessionContext.java:467) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.acknowledge(ClientSessionImpl.java:746) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.doAck(ClientConsumerImpl.java:1105) at org.apache.activemq.artemis.core.client.impl.ClientConsumerImpl.flushAcks(ClientConsumerImpl.java:789) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.flushAcks(ClientSessionImpl.java:1646) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollback(ClientSessionImpl.java:564) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollback(ClientSessionImpl.java:538) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.rollbackOnFailover(ClientSessionImpl.java:472) at org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.commit(ClientSessionImpl.java:493) at org.apache.activemq.artemis.jms.client.ActiveMQSession.commit(ActiveMQSession.java:218) at org.jboss.qa.hornetq.apps.clients.Receiver11.commitSession(Receiver11.java:42) at org.jboss.qa.hornetq.apps.clients.ReceiverTransAck.run(ReceiverTransAck.java:103) Caused by: ActiveMQIllegalStateException[errorType=ILLEGAL_STATE message=AMQ119027: Could not find reference on consumer ID=0, messageId = 49,392,176,659 queue = jms.queue.testQueue0] ... 14 more
The problem is that during rollback, client stops session, sends ack for last received message, sends rollback packet and then starts session again and if exception is thrown for sending ack then it never starts session again - ClientSessionImpl.rollback:
public void rollback(final boolean isLastMessageAsDelivered, final boolean waitConsumers) { ... boolean wasStarted = started; if (wasStarted) { stop(); } // We need to make sure we don't get any inflight messages for (ClientConsumerInternal consumer : cloneConsumers()) { consumer.clear(waitConsumers); } // Acks must be flushed here *after connection is stopped and all onmessages finished executing flushAcks(); sessionContext.simpleRollback(isLastMessageAsDelivered); if (wasStarted) { start(); } ... }
This is not a regression. This code is pretty old, the same is in EAP 6/HornetQ and it seems that work on network layer (in Netty processor) to allow more parallel execution increased chances to hit it.
- clones
-
JBEAP-13000 Rollback of JMS session after failover/failback can leave JMS session stopped
- Reopened
- is cloned by
-
ENTMQBR-8480 Rollback of JMS session after failover/failback can leave JMS session stopped
- Backlog