Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-2925

HA failback, intermittently, does not occur

    XMLWordPrintable

Details

    • Hide

      while true:
      check master is live, slave is standby
      kill -9 master broker, wait
      check slave is live
      start master broker, wait
      check master is live and slave is standby (failure occurs here), wait

      Failure occurs after N iterations (last test the failure occurred after ~200 iterations.

      Show
      while true: check master is live, slave is standby kill -9 master broker, wait check slave is live start master broker, wait check master is live and slave is standby (failure occurs here), wait Failure occurs after N iterations (last test the failure occurred after ~200 iterations.

    Description

      For the following configuration:

      Master:

            <ha-policy>
               <replication>
                  <master>
                     <check-for-live-server>true</check-for-live-server>
                     <group-name>prod-pair-1</group-name>
                     <vote-on-replication-failure>true</vote-on-replication-failure>
                  </master>
               </replication>
            </ha-policy>
      

      Slave:

            <ha-policy>
               <replication>
                  <slave>
                     <allow-failback>true</allow-failback>
                     <group-name>prod-pair-1</group-name>
                  </slave>
               </replication>
            </ha-policy>
      

      Failback to master intermittently fails due to unknown conditions - failure rate is approximately .5% to 33%. The master appears to start without checking for the presence of a live server, and the end result is to live servers. The following log sequence is observed*:

      Master:

      2019-09-24 22:28:12,205 INFO  [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=true,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/largemessages,pagingDirectory=./data/paging)
      ...
      2019-09-24 22:28:15,769 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
      

      All that is logged on the slave is:

      2019-09-24 22:28:15,756 WARN  [org.apache.activemq.artemis.core.client] AMQ212034: There are more than one servers on the network broadcasting the same node id. You will see this message exactly once (per node) if a node is restarted, in which case it can be safely ignored. But if it is logged continuously it means you really do have more than one node on the same network active concurrently with the same node id. This could occur if you have a backup node active at the same time as its live node. nodeID=50bb72f2-de2c-11e9-94c6-005056be7594
      

      A normal/success sequence for this master broker is:

      2019-09-24 22:25:41,756 INFO  [org.apache.activemq.artemis.core.server] AMQ221000: live Message Broker is starting with configuration Broker Configuration (clustered=true,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/largemessages,pagingDirectory=./data/paging)
      2019-09-24 22:25:45,641 INFO  [org.apache.activemq.artemis.core.server] AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.7.0-PATCH-2930 [null] started, waiting live to fail before it gets active
      2019-09-24 22:25:47,134 INFO  [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=50bb72f2-de2c-11e9-94c6-005056be7594 is synchronized with live-server.
      2019-09-24 22:25:47,818 INFO  [org.apache.activemq.artemis.core.server] AMQ221007: Server is now live
      

      And the slave normal/success logs are:

      2019-09-24 22:25:45,834 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending AIOSequentialFile:/mq/amq/./data/journal/activemq-data-175801.amq (size=10,485,760) to replica.
      2019-09-24 22:25:46,328 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending AIOSequentialFile:/mq/amq/./data/journal/activemq-data-175800.amq (size=10,485,760) to replica.
      2019-09-24 22:25:46,574 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /mq/amq/./data/bindings/activemq-bindings-4030.bindings (size=1,048,576) to replica.
      2019-09-24 22:25:46,592 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /mq/amq/./data/bindings/activemq-bindings-4031.bindings (size=1,048,576) to replica.
      2019-09-24 22:25:46,610 INFO  [org.apache.activemq.artemis.core.server] AMQ221025: Replication: sending NIOSequentialFile /mq/amq/./data/bindings/activemq-bindings-4029.bindings (size=1,048,576) to replica.
      2019-09-24 22:25:47,162 INFO  [org.apache.activemq.artemis.core.server] AMQ221029: stopped bridge $.artemis.internal.sf.prod1-cluster.7cb5b06b-de2c-11e9-836f-005056be48f5
      2019-09-24 22:25:47,177 INFO  [org.apache.activemq.artemis.core.server] AMQ221029: stopped bridge $.artemis.internal.sf.prod1-cluster.a8997fb1-de2c-11e9-bd93-0050569a3208
      2019-09-24 22:25:47,178 INFO  [org.apache.activemq.artemis.core.server] AMQ221029: stopped bridge $.artemis.internal.sf.prod1-cluster.d4ac9c32-de2c-11e9-9463-0050569ae521
      2019-09-24 22:25:47,178 INFO  [org.apache.activemq.artemis.core.server] AMQ221029: stopped bridge $.artemis.internal.sf.prod1-cluster.00bd6ef2-de2d-11e9-80cc-005056be2edc
      2019-09-24 22:25:47,213 INFO  [org.apache.activemq.artemis.core.server] AMQ221002: Apache ActiveMQ Artemis Message Broker version 2.7.0-PATCH-2930 [50bb72f2-de2c-11e9-94c6-005056be7594] stopped, uptime 2 minutes
      2019-09-24 22:25:47,213 INFO  [org.apache.activemq.artemis.core.server] AMQ221039: Restarting as Replicating backup server after live restart
      2019-09-24 22:25:47,213 INFO  [org.apache.activemq.artemis.core.server] AMQ221000: backup Message Broker is starting with configuration Broker Configuration (clustered=true,journalDirectory=./data/journal,bindingsDirectory=./data/bindings,largeMessagesDirectory=./data/largemessages,pagingDirectory=./data/paging)
      2019-09-24 22:25:47,214 INFO  [org.apache.activemq.artemis.core.server] AMQ221055: There were too many old replicated folders upon startup, removing /mq/amq/./data/bindings/oldreplica.337
      2019-09-24 22:25:47,217 INFO  [org.apache.activemq.artemis.core.server] AMQ222162: Moving data directory /mq/amq/./data/bindings to /mq/amq/./data/bindings/oldreplica.339
      2019-09-24 22:25:47,217 INFO  [org.apache.activemq.artemis.core.server] AMQ221055: There were too many old replicated folders upon startup, removing /mq/amq/./data/journal/oldreplica.338
      2019-09-24 22:25:47,223 INFO  [org.apache.activemq.artemis.core.server] AMQ222162: Moving data directory /mq/amq/./data/journal to /mq/amq/./data/journal/oldreplica.340
      2019-09-24 22:25:47,223 INFO  [org.apache.activemq.artemis.core.server] AMQ221055: There were too many old replicated folders upon startup, removing /mq/amq/./data/paging/oldreplica.337
      2019-09-24 22:25:47,224 INFO  [org.apache.activemq.artemis.core.server] AMQ222162: Moving data directory /mq/amq/./data/paging to /mq/amq/./data/paging/oldreplica.339
      2019-09-24 22:25:47,225 INFO  [org.apache.activemq.artemis.core.server] AMQ221012: Using AIO Journal
      2019-09-24 22:25:48,360 INFO  [org.apache.activemq.artemis.core.server] AMQ221109: Apache ActiveMQ Artemis Backup Server version 2.7.0-PATCH-2930 [null] started, waiting live to fail before it gets active
      2019-09-24 22:25:49,912 INFO  [org.apache.activemq.artemis.core.server] AMQ221024: Backup server ActiveMQServerImpl::serverUUID=50bb72f2-de2c-11e9-94c6-005056be7594 is synchronized with live-server.
      2019-09-24 22:25:51,819 INFO  [org.apache.activemq.artemis.core.server] AMQ221031: backup announced
      
      • Note logs lines were cherry picked from the full logs, but the full logs are available.

      Attachments

        Issue Links

          Activity

            People

              fnigro Francesco Nigro
              rhn-support-shiggs Stephen Higgs
              Oleg Sushchenko Oleg Sushchenko
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: