Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-2982

Split Brain in 6-node Cluster after Network Failover

XMLWordPrintable

    • Hide

      I have been able to approximate the behavior described by creating 3 virtual hosts:

      host1 with master1 and master3 brokers
      host2 with master2 broker
      host3 with slave1, slave2, and slave3 brokers

      If I start all brokers and allow them to form cluster connections and finish initial replication, then start iptables on host2 with the following ruleset:

      # Firewall configuration written by system-config-firewall
      # Manual customization of this file is not recommended.
      *filter
      :INPUT ACCEPT [0:0]
      :FORWARD ACCEPT [0:0]
      :OUTPUT ACCEPT [0:0]
      -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
      -A INPUT -p icmp -j ACCEPT
      -A INPUT -i lo -j ACCEPT
      -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT
      -A INPUT -j REJECT --reject-with icmp-host-prohibited
      -A FORWARD -j REJECT --reject-with icmp-host-prohibited
      COMMIT
      

      Then wait until I see a quorum vote request from the slave2 broker and stop iptables, I can force a split-brain every time.

      Restoring connectivity does not result in a failback and I can see that the master2 and slave2 brokers remain live, even though there is an intact connection between the two.

      Unfortunately, this does not seem to exactly mirror the use case in the description, as I see a quorum vote in the 02 master logs not present in the described use case.

      Show
      I have been able to approximate the behavior described by creating 3 virtual hosts: host1 with master1 and master3 brokers host2 with master2 broker host3 with slave1, slave2, and slave3 brokers If I start all brokers and allow them to form cluster connections and finish initial replication, then start iptables on host2 with the following ruleset: # Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT Then wait until I see a quorum vote request from the slave2 broker and stop iptables, I can force a split-brain every time. Restoring connectivity does not result in a failback and I can see that the master2 and slave2 brokers remain live, even though there is an intact connection between the two. Unfortunately, this does not seem to exactly mirror the use case in the description, as I see a quorum vote in the 02 master logs not present in the described use case.

      In a virtualized environment with network failover capability (redundant network cards with failover bonding or similar, such that physically or electrically disconnecting the primary network interface results in a failover with the same IP address pool to the secondary interface), network failover sometimes results in a split-brain, in which the slave broker goes live and stays live, even though connectivity to the replication master is not lost. Behavior is similar to behaviors described in ENTMQBR-2377 and ENTMQBR-2476, with the important caveat that we never see a quorum vote initiated by the master involved in the split - only on the slave.

      Consider the following architecture:

      host1         host2         host3
      ====          ====          ==== 
      master1       master2       slave1
      master3                     slave2
                                  slave3
      

      in the above cluster, each physical host has 2 network interfaces, configured for failover. Each guest (AMQ vm) has only 1 virtual NIC, bridged to the physical failover interface.

      In a scenario where the swich port to which the primary NIC of host2 (hosting only the master2 broker) is disabled, forcing failover to the secondary NIC (note that IP / MAC addresses in the virtual network do not change in this scenario; The only felt effect at the virtual network level should be a momentary isolation), we see the slave2 broker initiate a quorum vote and become the live broker, based on responsed from master1 and master3. The master2 broker does not detect any failure between itself and slave2 and does not initiate a quorum vote and remains live. After the failover / split brain occurs, we can see via netstat that the connection between the master2 and slave2 broker is intact, with the same client and server ports as before the failover. No RST or RETRANS events are observed between the slave and master involved in the split.

              rh-ee-ataylor Andy Taylor
              rhn-support-dhawkins Duane Hawkins
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: