Status: Resolved (View Workflow)
Steps to Reproduce:git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ groovy -DEAP_ZIP_URL=https: //eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/EAP7/view/EAP7-JMS/view/early-testing/view/tooling/job/early-testing-messaging-prepare/257/artifact/jboss-eap.zip PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=NetworkFailuresHornetQCoreBridges#testNetworkFailureSmallMessages -DfailIfNoTests= false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1521531853-SNAPSHOT | tee log or mvn clean test -Dtest=Lodh4TestCase#testFailOfOneServer -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1521531853-SNAPSHOT -DfailIfNoTests= false -Deap=7x | tee log or mvn clean test -Dtest=DedicatedFailoverCoreBridges#testFailbackKillWithBridgeWithStaticNIOConnectors -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1521531853-SNAPSHOT -DfailIfNoTests= false -Deap=7x | tee log
- There are two Artemis brokers configured to form cluster
- There is a producer sending messages to broker 1 and receiver receiving messages from broker 2
- Between the brokers there is a proxy which simulates network failure
- The proxy is several times stopped and restarted to simulate the network failure
- The test expects that all messages sent to broker 1 will be received by receiver from broker 2 (despite the network failures)
Reality: After the proxy is stopped and restarted, the cluster is not able to form again. Both brokers try to reconnect to their opposites but with no luck.
Customer scenario: Messaging cluster is not able to recover after network failures.
Investigation of issue
I investigated why brokers are not able to reconnect and I found out that always when they try to reconnect, they give it up because there is no topology record for nodeId where they try to connect. So the re-connection attempt ends here .
I compared the behavior with Artemis 1.x and I found out that Artemis 2.x removes the topology member when connection failure is detected, but Artemis 1.x doesn't. When I commented the line  it fixed the issue. This line is not present in 1.x.