Status: Verified (View Workflow)
Affects Version/s: 7.0.0.ER6
Steps to Reproduce:git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout refactoring_modules groovy -DEAP_VERSION=7.0.0.ER6 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap export JOURNAL_DIRECTORY_A=$WORKSPACE/journal-A export JOURNAL_DIRECTORY_B=$WORKSPACE/journal-B export JOURNAL_DIRECTORY_C=$WORKSPACE/journal-C export JOURNAL_DIRECTORY_D=$WORKSPACE/journal-D cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedColocatedClusterFailoverTestCase#testFailbackWithMdbsShutdown -DfailIfNoTests= false -Deap=7x | tee log
Release Notes Docs Status:Documented as Known Issue
Scenario: We have two nodes in (manually created) colocated replicated topology. Both nodes contain InQueue and OutQueue.
- We send 2000 messages (mix of large and normal) to InQueue on node 1
- On each node we deploy MDB which resend messages from InQueue to OutQueue
- During resending of messages we cleanly shutdown node 2 and after some time we start it again
- We receive messages from OutQueue on node 1 and check if number of received messages equals to number of send messages
Expectation: all messages will be resent
Actual state: some messages are not resent and they are lost
Customer impact: large messages might get lost in colocated HA topology with replicated journal if one of the servers is cleanly shutdown
As you can see in  and , lost messages are stuck in sf.my-cluster queue of node 2 and corresponding large message files have zero length. Bodies of lost messages are in largemessages1, see .
Race condition which cause loss of messages
- Node 2 decides to redistribute message-1 to node 1
- It creates copy of message-1 with new messageID (message-2) and message-1 is considered as delivered
- In the meantime the node 2 is shutting down and thus redistribution of message-2 to node 1 fails
- After that backup on node 1 comes to alive and it continues in redistribution of message-2 to live on node 1
- Backup knows about message-2 but it does not have body of this message, it sends only header packet and waits for acknowledge from live. Live receives header packet and waits for chunk packets. Both servers wait for each other.
- Node 2 is started again. Live on node 2 synchronizes with backup on node 1 and thus it receives message-2 with body of zero length.
- Again node 2 sends only header packet and waits for acknowledge and node 1 receives header packet and waits for chunks.
- Message-2 is stuck in sf.my-cluster queue and its body is lost.