Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-2476

Live server does not shutdown when using vote-on-replication-failure

    XMLWordPrintable

Details

    Description

      Attaching a reproducer setup where you can see that even with "vote-on-replication-failure" enabled the isolated master doesn't shutdown as stated in Live Voting.

      On my machine I have 3 master-slave pairs static cluster setup with replication (6 brokers in total, using different ports).

      Masters:

      • node1 (acceptor: 61616, group: red)
      • node2 (acceptor: 61617, group: blue)
      • node3 (acceptor: 61618, group: green)

      Slaves:

      • node1s (acceptor: 61619, group: red)
      • node2s (acceptor: 61620, group: blue)
      • node3s (acceptor: 61621, group: green)

      I'm simulating a network partition by simply blocking the port using PacketFilter on MacOS (you can also use iptables on Linux):

      # block port
      (sudo pfctl -sr 2>/dev/null; echo "block drop quick on lo0 proto tcp from any to any port = 61618") | sudo pfctl -e -f - 2>/dev/null ; nmap localhost -p 61618
      # unblock port 
      (sudo pfctl -sr 2>/dev/null | fgrep -v "block drop quick on lo0 proto tcp from any to any port = 61618") | sudo pfctl -f - ; nmap localhost -p 61618
      

      This is what I see on the isolated master (node3 on port 61618) that has "vote-on-replication-failure=true":

      2019-04-26 14:49:44,297 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: Operation timed out [code=GENERIC_EXCEPTION]
      2019-04-26 14:49:44,299 INFO  [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
      2019-04-26 14:49:44,301 INFO  [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 5 seconds for quorum vote results.
      2019-04-26 14:49:44,298 WARN  [org.apache.activemq.artemis.core.server] AMQ222092: Connection to the backup node failed, removing replication now: ActiveMQException[errorType=GENERIC_EXCEPTION message=Operation timed out]
      	at org.apache.activemq.artemis.core.remoting.impl.netty.ActiveMQChannelHandler.exceptionCaught(ActiveMQChannelHandler.java:101) [artemis-core-client-2.6.3.redhat-00020.jar:2.6.3.redhat-00020]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:256) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.DefaultChannelPipeline$HeadContext.exceptionCaught(DefaultChannelPipeline.java:1401) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:264) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:953) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:125) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:174) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:646) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:581) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:460) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.6.3.redhat-00020.jar:2.6.3.redhat-00020]
      Caused by: java.io.IOException: Operation timed out
      	at sun.nio.ch.FileDispatcherImpl.read0(Native Method) [rt.jar:1.8.0_181]
      	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) [rt.jar:1.8.0_181]
      	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) [rt.jar:1.8.0_181]
      	at sun.nio.ch.IOUtil.read(IOUtil.java:192) [rt.jar:1.8.0_181]
      	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) [rt.jar:1.8.0_181]
      	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:288) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:1108) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:347) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:148) [netty-all-4.1.28.Final-redhat-00001.jar:4.1.28.Final-redhat-00001]
      	... 6 more
      
      2019-04-26 14:49:44,311 INFO  [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true]
      2019-04-26 14:49:44,311 INFO  [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true]
      2019-04-26 14:49:44,315 INFO  [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true]
      2019-04-26 14:49:44,315 INFO  [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true]
      2019-04-26 14:49:44,315 INFO  [org.apache.activemq.artemis.core.server] AMQ221068: Received all quorum votes.
      2019-04-26 14:49:48,786 WARN  [org.apache.activemq.artemis.core.client] AMQ212037: Connection failure has been detected: AMQ229014: Did not receive data from /127.0.0.1:49158 within the 60,000ms connection TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
      2019-04-26 14:49:48,792 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 96e72194-6821-11e9-a8fa-88e9fe87bcb0
      2019-04-26 14:49:48,793 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 96e72194-6821-11e9-a8fa-88e9fe87bcb0
      2019-04-26 14:49:48,793 WARN  [org.apache.activemq.artemis.core.server] AMQ222061: Client connection failed, clearing up resources for session 96e96b85-6821-11e9-a8fa-88e9fe87bcb0
      2019-04-26 14:49:48,793 WARN  [org.apache.activemq.artemis.core.server] AMQ222107: Cleared up resources for session 96e96b85-6821-11e9-a8fa-88e9fe87bcb0
      

      This is the view from its slave becoming live:

      2019-04-26 14:47:35,034 INFO  [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122 is synchronized with live-server.
      2019-04-26 14:47:35,072 INFO  [org.apache.activemq.artemis.core.server] AMQ221031: backup announced
      2019-04-26 14:47:35,123 INFO  [io.hawt.web.AuthenticationFilter] Starting hawtio authentication filter, JAAS realm: "activemq" authorized role(s): "amq" role principal classes: "org.apache.activemq.artemis.spi.core.security.jaas.RolePrincipal"
      2019-04-26 14:47:35,283 INFO  [io.hawt.web.JolokiaConfiguredAgentServlet] Jolokia overridden property: [key=policyLocation, value=file:/Users/fvaleri/cases/02264120/test/amq-broker-7.2.4/brokers/node3s/etc/jolokia-access.xml]
      2019-04-26 14:47:35,411 INFO  [io.hawt.web.RBACMBeanInvoker] Using MBean [hawtio:type=security,area=jmx,rank=0,name=HawtioDummyJMXSecurity] for role based access control
      2019-04-26 14:47:36,034 INFO  [io.hawt.system.ProxyWhitelist] Initial proxy whitelist: [localhost, 127.0.0.1, 192.168.1.102, 10.40.204.192, ovpn-204-192.brq.redhat.com]
      2019-04-26 14:47:36,907 INFO  [org.apache.activemq.artemis] AMQ241001: HTTP Server started at http://localhost:8166
      2019-04-26 14:47:36,908 INFO  [org.apache.activemq.artemis] AMQ241002: Artemis Jolokia REST API available at http://localhost:8166/console/jolokia
      2019-04-26 14:47:36,908 INFO  [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://localhost:8166/console
      2019-04-26 14:49:34,780 INFO  [org.apache.activemq.artemis.core.server] AMQ221066: Initiating quorum vote: LiveFailoverQuorumVote
      2019-04-26 14:49:34,783 INFO  [org.apache.activemq.artemis.core.server] AMQ221067: Waiting 30 seconds for quorum vote results.
      2019-04-26 14:49:34,798 INFO  [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=false]
      2019-04-26 14:49:34,799 INFO  [org.apache.activemq.artemis.core.server] AMQ221060: Sending quorum vote request to localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=false]
      2019-04-26 14:49:34,810 INFO  [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61617: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true]
      2019-04-26 14:49:34,810 INFO  [org.apache.activemq.artemis.core.server] AMQ221061: Received quorum vote response from localhost/127.0.0.1:61616: ServerConnectVote [nodeId=0e9174e0-6112-11e9-8bc0-acde48001122, vote=true]
      2019-04-26 14:49:34,810 INFO  [org.apache.activemq.artemis.core.server] AMQ221068: Received all quorum votes.
      2019-04-26 14:49:34,812 INFO  [org.apache.activemq.artemis.core.server] AMQ221071: Failing over based on quorum vote results.
      2019-04-26 14:49:34,822 INFO  [org.apache.activemq.artemis.core.server] AMQ221037: ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122 to become 'live'
      2019-04-26 14:49:34,962 INFO  [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address DLQ supporting [ANYCAST]
      2019-04-26 14:49:34,962 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue DLQ on address DLQ
      2019-04-26 14:49:34,963 INFO  [org.apache.activemq.artemis.core.server] AMQ221080: Deploying address ExpiryQueue supporting [ANYCAST]
      2019-04-26 14:49:34,963 INFO  [org.apache.activemq.artemis.core.server] AMQ221003: Deploying ANYCAST queue ExpiryQueue on address ExpiryQueue
      2019-04-26 14:49:35,024 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
      2019-04-26 14:49:35,049 INFO  [org.apache.activemq.artemis.core.server] AMQ221020: Started NIO Acceptor at 0.0.0.0:61621 for protocols [CORE,AMQP]
      2019-04-26 14:49:35,138 INFO  [org.apache.activemq.artemis.core.server] AMQ221027: Bridge ClusterConnectionBridge@57760fb1 [name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@4d7449df targetConnector=ServerLocatorImpl (identity=(Cluster-connection-bridge::ClusterConnectionBridge@57760fb1 [name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.eff9e281-6111-11e9-aa7c-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@4d7449df targetConnector=ServerLocatorImpl [initialConnectors=[TransportConfiguration(name=node2-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61617&host=localhost], discoveryGroupConfiguration=null]]::ClusterConnectionImpl@381493944[nodeUUID=0e9174e0-6112-11e9-8bc0-acde48001122, connector=TransportConfiguration(name=node3s-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61621&host=localhost, address=, server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122])) [initialConnectors=[TransportConfiguration(name=node2-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61617&host=localhost], discoveryGroupConfiguration=null]] is connected
      2019-04-26 14:49:35,138 INFO  [org.apache.activemq.artemis.core.server] AMQ221027: Bridge ClusterConnectionBridge@4c726392 [name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@493a7454 targetConnector=ServerLocatorImpl (identity=(Cluster-connection-bridge::ClusterConnectionBridge@4c726392 [name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, queue=QueueImpl[name=$.artemis.internal.sf.static-cluster.d89851ca-6111-11e9-acd4-acde48001122, postOffice=PostOfficeImpl [server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122], temp=false]@493a7454 targetConnector=ServerLocatorImpl [initialConnectors=[TransportConfiguration(name=node1-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61616&host=localhost], discoveryGroupConfiguration=null]]::ClusterConnectionImpl@381493944[nodeUUID=0e9174e0-6112-11e9-8bc0-acde48001122, connector=TransportConfiguration(name=node3s-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61621&host=localhost, address=, server=ActiveMQServerImpl::serverUUID=0e9174e0-6112-11e9-8bc0-acde48001122])) [initialConnectors=[TransportConfiguration(name=node1-connector, factory=org-apache-activemq-artemis-core-remoting-impl-netty-NettyConnectorFactory) ?port=61616&host=localhost], discoveryGroupConfiguration=null]] is connected
      

      If I then reopen the master's port we have split-brain scenario 100% of times.
      I will also attach my cluster configuration.

      Attachments

        Issue Links

          Activity

            People

              ataylor@redhat.com Andy Taylor
              rhn-support-fvaleri Federico Valeri
              Michal Toth Michal Toth
              Votes:
              1 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: