-
Bug
-
Resolution: Done
-
Critical
-
AMQ 7.2.4.GA
-
None
Attaching a reproducer setup where you can see that even with "vote-on-replication-failure" enabled the isolated master doesn't shutdown as stated in Live Voting.
On my machine I have 3 master-slave pairs static cluster setup with replication (6 brokers in total, using different ports).
Masters:
- node1 (acceptor: 61616, group: red)
- node2 (acceptor: 61617, group: blue)
- node3 (acceptor: 61618, group: green)
Slaves:
- node1s (acceptor: 61619, group: red)
- node2s (acceptor: 61620, group: blue)
- node3s (acceptor: 61621, group: green)
I'm simulating a network partition by simply blocking the port using PacketFilter on MacOS (you can also use iptables on Linux):
# block port (sudo pfctl -sr 2>/dev/null; echo "block drop quick on lo0 proto tcp from any to any port = 61618") | sudo pfctl -e -f - 2>/dev/null ; nmap localhost -p 61618 # unblock port (sudo pfctl -sr 2>/dev/null | fgrep -v "block drop quick on lo0 proto tcp from any to any port = 61618") | sudo pfctl -f - ; nmap localhost -p 61618
This is what I see on the isolated master (node3 on port 61618) that has "vote-on-replication-failure=true":
2019-04-26 14:49:44,297 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: Operation timed out [code=GENERIC_EXCEPTION] 2019-04-26 14:49:44,299 INFO [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote 2019-04-26 14:49:44,301 INFO [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 5 seconds for quorum vote results. 2019-04-26 14:49:44,298 WARN [org.apache.activemq.artemis.core.server] AMQ222092: Connection to the backup node failed, removing replication now: ActiveMQException[errorType=GENERIC_EXCEPTION message=Operation timed out] at org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.exceptionCaught(ActiveMQChannelHandler.java:101) [artemis-core-client-2.6.3.redhat-00020.jar:2.6.3.redhat-00020] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1401) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:953) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.6.3.redhat-00020.jar:2.6.3.redhat-00020] Caused by: java.io.IOException: Operation timed out at sun.nio.ch.FileDispatcherImpl.read0(Native Method) [rt.jar:1.8.0_181] at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) [rt.jar:1.8.0_181] at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) [rt.jar:1.8.0_181] at sun.nio.ch.IOUtil.read(IOUtil.java:192) [rt.jar:1.8.0_181] at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) [rt.jar:1.8.0_181] at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001] ... 6 more 2019-04-26 14:49:44,311 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true] 2019-04-26 14:49:44,311 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true] 2019-04-26 14:49:44,315 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true] 2019-04-26 14:49:44,315 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true] 2019-04-26 14:49:44,315 INFO [org.apache.activemq.artemis.core.server] AMQ221068: Received all quorum votes. 2019-04-26 14:49:48,786 WARN [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ229014: Did not receive data from /127.0.0.1:49158 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT] 2019-04-26 14:49:48,792 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 96e72194-6821-11e9-a8fa-88e9fe87bcb0 2019-04-26 14:49:48,793 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 96e72194-6821-11e9-a8fa-88e9fe87bcb0 2019-04-26 14:49:48,793 WARN [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 96e96b85-6821-11e9-a8fa-88e9fe87bcb0 2019-04-26 14:49:48,793 WARN [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 96e96b85-6821-11e9-a8fa-88e9fe87bcb0
This is the view from its slave becoming live:
2019-04-26 14:47:35,034 INFO [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122 is synchronized with live-server. 2019-04-26 14:47:35,072 INFO [org.apache.activemq.artemis.core.server] AMQ221031: backup announced 2019-04-26 14:47:35,123 INFO [io.hawt.web.AuthenticationFilter] Starting hawtio authentication filter, JAAS realm: "activemq" authorized role(s): "amq" role principal classes: "org.apache.activemq.artemis.spi.core.security.jaas.RolePrincipal" 2019-04-26 14:47:35,283 INFO [io.hawt.web.JolokiaConfiguredAgentServlet] Jolokia overridden property: [key=policyLocation, value=file:/Users/fvaleri/cases/02264120/test/amq-broker-7.2.4/brokers/node3s/etc/jolokia-access.xml] 2019-04-26 14:47:35,411 INFO [io.hawt.web.RBACMBeanInvoker] Using MBean [hawtio:type=security,area=jmx,rank=0,name=HawtioDummyJMXSecurity] for role based access control 2019-04-26 14:47:36,034 INFO [io.hawt.system.ProxyWhitelist] Initial proxy whitelist: [localhost, 127.0.0.1, 192.168.1.102, 10.40.204.192, ovpn-204-192.brq.redhat.com] 2019-04-26 14:47:36,907 INFO [org.apache.activemq.artemis] AMQ241001: HTTP Server started at http://localhost:8166 2019-04-26 14:47:36,908 INFO [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia REST API available at http://localhost:8166/console/jolokia 2019-04-26 14:47:36,908 INFO [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://localhost:8166/console 2019-04-26 14:49:34,780 INFO [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote 2019-04-26 14:49:34,783 INFO [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results. 2019-04-26 14:49:34,798 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=false] 2019-04-26 14:49:34,799 INFO [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=false] 2019-04-26 14:49:34,810 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true] 2019-04-26 14:49:34,810 INFO [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true] 2019-04-26 14:49:34,810 INFO [org.apache.activemq.artemis.core.server] AMQ221068: Received all quorum votes. 2019-04-26 14:49:34,812 INFO [org.apache.activemq.artemis.core.server] AMQ221071: Failing over based on quorum vote results. 2019-04-26 14:49:34,822 INFO [org.apache.activemq.artemis.core.server] AMQ221037: ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122 to become 'live' 2019-04-26 14:49:34,962 INFO [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address DLQ supporting [ANYCAST] 2019-04-26 14:49:34,962 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue DLQ on address DLQ 2019-04-26 14:49:34,963 INFO [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address ExpiryQueue supporting [ANYCAST] 2019-04-26 14:49:34,963 INFO [org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue ExpiryQueue on address ExpiryQueue 2019-04-26 14:49:35,024 INFO [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live 2019-04-26 14:49:35,049 INFO [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:61621 for protocols [CORE,AMQP] 2019-04-26 14:49:35,138 INFO [org.apache.activemq.artemis.core.server] AMQ221027: Bridge ClusterConnectionBridge@57760fb1 [name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@4d7449df targetConnector=ServerLocatorImpl (identity=(Cluster-connection-bridge::ClusterConnectionBridge@57760fb1 [name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@4d7449df targetConnector=ServerLocatorImpl [initialConnectors=[TransportConfiguration(name=node2-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61617&host=localhost], discoveryGroupConfiguration=null]]::ClusterConnectionImpl@381493944[nodeUUID=0e9174e0-6112-11e9-8bc0-acde48001122, connector=TransportConfiguration(name=node3s-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61621&host=localhost, address=, server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122])) [initialConnectors=[TransportConfiguration(name=node2-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61617&host=localhost], discoveryGroupConfiguration=null]] is connected 2019-04-26 14:49:35,138 INFO [org.apache.activemq.artemis.core.server] AMQ221027: Bridge ClusterConnectionBridge@4c726392 [name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@493a7454 targetConnector=ServerLocatorImpl (identity=(Cluster-connection-bridge::ClusterConnectionBridge@4c726392 [name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@493a7454 targetConnector=ServerLocatorImpl [initialConnectors=[TransportConfiguration(name=node1-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61616&host=localhost], discoveryGroupConfiguration=null]]::ClusterConnectionImpl@381493944[nodeUUID=0e9174e0-6112-11e9-8bc0-acde48001122, connector=TransportConfiguration(name=node3s-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61621&host=localhost, address=, server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122])) [initialConnectors=[TransportConfiguration(name=node1-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61616&host=localhost], discoveryGroupConfiguration=null]] is connected
If I then reopen the master's port we have split-brain scenario 100% of times.
I will also attach my cluster configuration.
- is related to
-
ENTMQBR-2678 After isolated master is live again it is unable to connect to the cluster
- Closed
-
ENTMQBR-2982 Split Brain in 6-node Cluster after Network Failover
- Closed
-
AMQDOC-3108 [AMQ7, HA] add "vote-on-replication-failure=true" as recommended configuration
- Closed
- mentioned in
-
Page Loading...