-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
AMQ 7.4.1
In a virtualized environment with network failover capability (redundant network cards with failover bonding or similar, such that physically or electrically disconnecting the primary network interface results in a failover with the same IP address pool to the secondary interface), network failover sometimes results in a split-brain, in which the slave broker goes live and stays live, even though connectivity to the replication master is not lost. Behavior is similar to behaviors described in ENTMQBR-2377 and ENTMQBR-2476, with the important caveat that we never see a quorum vote initiated by the master involved in the split - only on the slave.
Consider the following architecture:
host1 host2 host3 ==== ==== ==== master1 master2 slave1 master3 slave2 slave3
in the above cluster, each physical host has 2 network interfaces, configured for failover. Each guest (AMQ vm) has only 1 virtual NIC, bridged to the physical failover interface.
In a scenario where the swich port to which the primary NIC of host2 (hosting only the master2 broker) is disabled, forcing failover to the secondary NIC (note that IP / MAC addresses in the virtual network do not change in this scenario; The only felt effect at the virtual network level should be a momentary isolation), we see the slave2 broker initiate a quorum vote and become the live broker, based on responsed from master1 and master3. The master2 broker does not detect any failure between itself and slave2 and does not initiate a quorum vote and remains live. After the failover / split brain occurs, we can see via netstat that the connection between the master2 and slave2 broker is intact, with the same client and server ports as before the failover. No RST or RETRANS events are observed between the slave and master involved in the split.
- relates to
-
ENTMQBR-2476 Live server does not shutdown when using vote-on-replication-failure
- Closed
-
ENTMQBR-2377 Slave, which was activated by split brain, does not deactivate even after network recovery to master
- Closed