Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-932

Failover to Slave Does Not Occur is Killing master with Kill -9 in a more than 2 broker cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • AMQ 7.1.0.CR2
    • A-MQ 7.0.0.CR1
    • high-availability
    • None
      • create a 2 master, 2 slave broker cluster
      • kill a master using 'kill -9 <pid>'

      With Broker 1 and 2 being masters, 3 and 4 being slaves. If broker 1 is killed neither 3 or 4 takes over. The following logging would occur on brokers 2 and 3.

      broker 2:
      21:05:00,807 WARN [org.apache.activemq.artemis.core.server] AMQ222095: Connection failed with failedOver=false
      21:05:00,888 INFO [org.apache.activemq.artemis.core.server] AMQ221062: Received quorum vote request: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
      21:05:00,888 INFO [org.apache.activemq.artemis.core.server] AMQ221064: Node db48f7df-ca56-11e7-a4f0-185e0fcdacdf found in cluster topology
      21:05:00,889 INFO [org.apache.activemq.artemis.core.server] AMQ221063: Sending quorum vote response: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]

      broker 3:
      21:05:00,860 INFO [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
      21:05:00,864 INFO [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results.
      21:05:00,872 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61617: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
      21:05:00,891 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61617: ServerConnectVote [nodeId=db48f7df-ca56-11e7-a4f0-185e0fcdacdf, vote=false]
      21:05:00,896 INFO [org.apache.activemq.artemis.core.server] AMQ221068: Received all quorum votes.
      21:05:00,896 INFO [org.apache.activemq.artemis.core.server] AMQ221070: Restarting as backup based on quorum vote results.
      21:05:00,922 WARN [org.apache.activemq.artemis.core.client] AMQ212004: Failed to connect to server.
      21:05:00,935 WARN [org.apache.activemq.artemis.core.client] AMQ212004: Failed to connect to server.

      From acxjbertr

      The issue as I see it now is that the 2nd live node always has the 1st (now dead) live node's information in it's topology so it responds to the quorum vote (incorrectly) as if the 1st live node is still active. I'm continuing to investigate.
      I set the <reconnect-attempts> on the cluster-connection to 0 and it still made no difference.
      So from what I've been able to investigate today when a broker is dropped with 'kill -9' the topology on the other live node doesn't change at all so that when the backup initiates a quorum vote it is told not to activate. I expected the topology to change when the node was dropped. Am I off-base in my expectation?

      BTW, I also tried using the "socat" utility to drop the network connection between the brokers instead of using 'kill -9' but the behavior was the same (i.e. the topology didn't change).
      I do think that's a piece of the puzzle that's missing. And the fact that it's missing makes sense.

      However, another piece that's missing is that the topology on the live node never changes so it will always tell the backup not to activate no matter how many times it sends a vote. At first I thought that maybe the issue was a race condition because everything was running locally and that the backup initiated the quorum vote before the other live server's topology was updated so I hard-coded increasingly long delays before the vote was sent and no matter how long I delayed the vote it always failed because the live server always had the dead broker in its topology (even with reconnect-attempts=0 in the cluster-connection). This is the bit that doesn't make sense to me. How can this not be working?

            rh-ee-ataylor Andy Taylor
            rhn-gps-mcochran Mary Cochran (Inactive)
            Michal Toth Michal Toth
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: