-
Bug
-
Resolution: Done
-
Blocker
-
A-MQ 7.0.0.CR1
-
None
-
-
- create a 2 master, 2 slave broker cluster
- kill a master using 'kill -9 <pid>'
With Broker 1 and 2 being masters, 3 and 4 being slaves. If broker 1 is killed neither 3 or 4 takes over. The following logging would occur on brokers 2 and 3.
broker 2:
21:05:00,807 WARN [org.apache.activemq.artemis.core.server] AMQ222095: Connection failed with failedOver=false
21:05:00,888 INFO [org.apache.activemq.artemis.core.server] AMQ221062: Received quorum vote request: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
21:05:00,888 INFO [org.apache.activemq.artemis.core.server] AMQ221064: Node db48f7df-ca56-11e7-a4f0-185e0fcdacdf found in cluster topology
21:05:00,889 INFO [org.apache.activemq.artemis.core.server] AMQ221063: Sending quorum vote response: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
broker 3:
21:05:00,860 INFO [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
21:05:00,864 INFO [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results.
21:05:00,872 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61617: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
21:05:00,891 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61617: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
21:05:00,896 INFO [org.apache.activemq.artemis.core.server] AMQ221068: Received all quorum votes.
21:05:00,896 INFO [org.apache.activemq.artemis.core.server] AMQ221070: Restarting as backup based on quorum vote results.
21:05:00,922 WARN [org.apache.activemq.artemis.core.client] AMQ212004: Failed to connect to server.
21:05:00,935 WARN [org.apache.activemq.artemis.core.client] AMQ212004: Failed to connect to server.
From acxjbertr
The issue as I see it now is that the 2nd live node always has the 1st (now dead) live node's information in it's topology so it responds to the quorum vote (incorrectly) as if the 1st live node is still active. I'm continuing to investigate.
I set the <reconnect-attempts> on the cluster-connection to 0 and it still made no difference.
So from what I've been able to investigate today when a broker is dropped with 'kill -9' the topology on the other live node doesn't change at all so that when the backup initiates a quorum vote it is told not to activate. I expected the topology to change when the node was dropped. Am I off-base in my expectation?
BTW, I also tried using the "socat" utility to drop the network connection between the brokers instead of using 'kill -9' but the behavior was the same (i.e. the topology didn't change).
I do think that's a piece of the puzzle that's missing. And the fact that it's missing makes sense.
However, another piece that's missing is that the topology on the live node never changes so it will always tell the backup not to activate no matter how many times it sends a vote. At first I thought that maybe the issue was a race condition because everything was running locally and that the backup initiated the quorum vote before the other live server's topology was updated so I hard-coded increasingly long delays before the vote was sent and no matter how long I delayed the vote it always failed because the live server always had the dead broker in its topology (even with reconnect-attempts=0 in the cluster-connection). This is the bit that doesn't make sense to me. How can this not be working?
- is related to
-
ENTMQBR-1019 Resurrected (becames live) master broker is not sharing data properly, identifies as "undefined" and prohibits it's slave to join cluster (HA topology)
- Closed
-
ENTMQBR-1021 [HA, MS1S2] When backup slave1 is killed, slave2 can't take role of backup, leaving HA on master only
- Closed
-
ENTMQBR-1018 When live-slave fails-back to master, it turns off everything down, even its console
- Closed
- is caused by
-
ARTEMIS-1565 Loading...