-
Bug
-
Resolution: Done
-
Blocker
-
None
-
None
-
Release Notes
-
-
Documented as Resolved Issue
I have a 3 pairs (6 VMs) of HA replication.
Such slave can't be stopped via "artemis-service stop", it has to be killed.
Also, the rest of HA loses track of this broker in topology.
Testing/reproducing scenario:
- Make sure HA is formed properly (3 HA pairs)
- Loop over master brokers and do following
- Kill -9 master broker
- Make sure slave becomes live
- Resurrect master broker
- Make sure master becomes live again, slave is backup again
Usually, this work OK for first or second execution of such test.
But there is a problem, when I execute it (f.e.) third time.
Master is killed, slave does not become live.
After master resurrection, slave becomes isolated from whole HA.
It might be connected to Quorum voting process (this seems like split-brain problem?).
Also, I am not sending ANY messages.
This time, this happened the very first try on the very first broker.
Note: After master is restarted, the topology is completely off and some of the brokers are using different topology than others.
Test log
[10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-204, instance:/home/jamq/ha-replication, slave:amq-10-37-145-214 [10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-202, instance:/home/jamq/ha-replication, slave:amq-10-37-145-205 [10:19:54] [INFO] tests.JAMQHA000Tests :: master:amq-10-37-145-203, instance:/home/jamq/ha-replication, slave:amq-10-37-145-223 [10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-214, instance:/home/jamq/ha-replication, master:amq-10-37-145-204 [10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-205, instance:/home/jamq/ha-replication, master:amq-10-37-145-202 [10:19:54] [INFO] tests.JAMQHA000Tests :: slave:amq-10-37-145-223, instance:/home/jamq/ha-replication, master:amq-10-37-145-203 [10:19:55] [INFO] tests.JAMQHA000Tests :: Running brokers: amq-10-37-145-223 pid:22065 amq-10-37-145-205 pid:16071 amq-10-37-145-204 pid:11722 amq-10-37-145-214 pid:12820 amq-10-37-145-203 pid:3704 amq-10-37-145-202 pid:18809 [10:19:55] [PASS] dtestlib.Test :: Master is live # result:True (exp. True), dur.:-1.00 [10:19:55] [PASS] dtestlib.Test :: Slave is backup # result:True (exp. True), dur.:-1.00 [10:19:55] [WARNING] dtestlib.sut.ExternalBrokerOperable :: kill -s 9 11722 [10:19:55] [PASS] dtestlib.Test :: [root@10.37.145.204]$ kill -s 9 11722 # ecode:0 (exp. True), dur.:0.07 [10:19:55] [INFO] tests.JAMQHA000Tests :: Slave should be master [10:19:55] [INFO] dtestlib.sut.ExternalBrokerClusterHA :: Wait for broker 'amq-10-37-145-214' to become live. [10:19:58] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1 [10:19:58] [INFO] dtestlib :: retry: round=1/15, elapsed=3.05/60, result=False [10:20:03] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1 [10:20:03] [INFO] dtestlib :: retry: round=2/15, elapsed=7.27/60, result=False [10:20:08] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1 [10:20:08] [INFO] dtestlib :: retry: round=3/15, elapsed=12.30/60, result=False [10:20:11] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt:1 [10:20:11] [INFO] dtestlib :: retry: round=4/15, elapsed=15.78/60, result=False [10:20:15] [FAIL] dtestlib.Test :: Connection count compare (exp vs act) # result:0 (exp. 1), dur.:-1.00 err_cnt .. FAILED
Slave log (214)
2018-04-25 10:14:06,236 INFO [io.hawt.web.JolokiaConfiguredAgentServlet] Jolokia overridden property: [key=policyLocation, value=file:/home/jamq/ha-replication/etc/jolokia-access.xml] 2018-04-25 10:14:06,283 INFO [io.hawt.web.RBACMBeanInvoker] Using MBean [hawtio:type=security,area=jmx,rank=0,name=HawtioDummyJMXSecurity] for role based access control 2018-04-25 10:14:06,540 INFO [io.hawt.system.ProxyWhitelist] Initial proxy whitelist: [localhost, 127.0.0.1, 10.37.145.214, dhcp-145-214.lab.eng.brq.redhat.com] 2018-04-25 10:14:06,997 INFO [org.apache.activemq.artemis] AMQ241001: HTTP Server started at http://0.0.0.0:8161 2018-04-25 10:14:06,997 INFO [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia REST API available at http://0.0.0.0:8161/console/jolokia 2018-04-25 10:14:06,997 INFO [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://0.0.0.0:8161/console 2018-04-25 10:14:10,598 INFO [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=974aeba4-4860-11e8-986b-5254007f6e6f is synchronized with live-server. 2018-04-25 10:14:14,544 INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced 2018-04-25 10:15:09,741 INFO [io.hawt.web.LoginServlet] hawtio login is using 1800 sec. HttpSession timeout >> killed master broker (204) [10:19:55] [PASS] dtestlib.Test :: [root@10.37.145.204]$ kill -s 9 11722 # ecode:0 (exp. True), dur.:0.07 2018-04-25 10:19:55,869 INFO [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote 2018-04-25 10:19:55,872 INFO [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results. 2018-04-25 10:19:55,904 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to 10.37.145.202/10.37.145.202:61616: ServerConnectVote [nodeId=974aeba4-4860-11e8-986b-5254007f6e6f, vote=false] 2018-04-25 10:19:55,924 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from 10.37.145.202/10.37.145.202:61616: ServerConnectVote [nodeId=974aeba4-4860-11e8-986b-5254007f6e6f, vote=true] 2018-04-25 10:20:25,874 INFO [org.apache.activemq.artemis.core.server] AMQ221069: Timeout waiting for quorum vote responses. 2018-04-25 10:20:35,875 WARN [org.apache.activemq.artemis.core.client] AMQ212041: Timed out waiting for netty channel to close 2018-04-25 10:21:14,566 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ119011: Did not receive data from server for org.apache.activemq.artemis.core.remoting.impl.netty.NettyConnection@6eebd7a8[ID=08040475, local= /10.37.145.214:59586, remote=10.37.145.202/10.37.145.202:61616] [code=CONNECTION_TIMEDOUT]
Reproducer:
Use ansible to deploy HA replication on 6 VMs.
git clone https://github.com/msgqe/ansible-broker-clusters ansible-playbook roles/ha-replication/provision.yml -i roles/ha-replication/inventory/deployment.yml -u root (you will need compiled JMS client - test is using connector client to figure out, if broker is up or not) Execute tests from dtests --dschema DSNJAMQExternalClusterHA --test-node 10.37.145.202 --test-node 10.37.145.203 --test-node 10.37.145.204 --test-node 10.37.145.205 --test-node 10.37.145.214 --test-node 10.37.145.223 --set-option=/test-node-dtest-home-refresh-ena=False --load-json-settings /home/mtoth/work/repos/dtests/dtests/dtestlib/sut/example_6brokerHA_config.json --exec JAMQHA111Tests/test_kill_start_master_failback
- relates to
-
WFWIP-11 [Artemis upgrade] Regression in replicated HA tests
- Resolved
- is blocked by
-
ARTEMIS-1842 Loading...