Uploaded image for project: 'JBoss A-MQ'
  1. JBoss A-MQ
  2. ENTMQ-1724

Broker in fabric master/slave set up is shut down late when it loses connection to zookeeper

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical Critical
    • JBoss A-MQ 6.3.x
    • JBoss A-MQ 6.3
    • broker, fabric8
    • None
    • CR2
    • Hide

      Note that root container and both ssh containers must be on different machines
      (so that you can block connection between particular pair using iptables).

      1. create containers and broker
        fabric:create --wait-for-provisioning
        fabric:mq-create --no-ssl --parent-profile mq-base --group fabric-group --kind MasterSlave --data /mnt/nfs/fuse-shared/fabricFaframTest fabric-broker
        container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M "   --host instance.a  --user ***** --password ***** container1
        container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M "   --host instance.b  --user ***** --password ***** container2
        wait-for-provisioning
        mq-create --no-ssl --assign-container container1  --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker
        mq-create --no-ssl --assign-container container2  --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker
        
      2. block connection between root container and container with master (let's say master is on container1)
        iptables -A INPUT -s instance.a -j DROP && iptables -A OUTPUT -d instance.a -j DROP
        
      3. watch the logs
        • the fabric chooses new master which starts booting (it will take up to 30s)
        • the old master sometimes go down also in 30s interval, but usually it takes much more time (in minutes) before it realizes that it is disconnected (so broker is still running)

      If your store does not support locking properly (or it is disabled in broker.xml) you will end up with two brokers for few minutes.

      If the store does supports locking properly the newly elected master will hang on filesystem lock. But the fabric will not know about old broker so the broker will be unreachable for clients which uses fabric discovery to determine master container.

      In my case FS locking was enabled on NFSv4 and you can see timeout between the new broker was elected as master and the broker actually started (since lock from old broker was released):

      instance.b.log
      2016-05-26 08:34:34,054 | INFO  | Group-3-thread-1 | ActiveMQServiceFactory           | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is now the master, starting the broker.
      2016-05-26 08:34:34,055 | INFO  | Group-3-thread-1 | ActiveMQServiceFactory           | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is being started.
      2016-05-26 08:34:34,068 | INFO  | AMQ-1-thread-1   | ActiveMQServiceFactory           | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | booting up a broker from: profile:broker.xml
      ...
      2016-05-26 08:34:34,772 | INFO  | AMQ-1-thread-1   | BrokerService                    | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Using Persistence Adapter: KahaDBPersistenceAdapter[/mnt/nfs/fuse-shared/fabricFaframTest/kahadb]
      2016-05-26 08:34:34,838 | INFO  | AMQ-1-thread-1   | SharedFileLocker                 | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Database /mnt/nfs/fuse-shared/fabricFaframTest/kahadb/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired
      2016-05-26 08:37:05,404 | INFO  | AMQ-1-thread-1   | MessageDatabase                  | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | KahaDB is version 5
      2016-05-26 08:37:05,432 | INFO  | AMQ-1-thread-1   | MessageDatabase                  | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovering from the journal @1:503
      2016-05-26 08:37:05,438 | INFO  | AMQ-1-thread-1   | MessageDatabase                  | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovery replayed 53 operations from the journal in 0.025 seconds.
      ...
      2016-05-26 08:37:05,799 | INFO  | AMQ-1-thread-1   | ActiveMQServiceFactory           | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker has started.
      
      Show
      Note that root container and both ssh containers must be on different machines (so that you can block connection between particular pair using iptables). create containers and broker fabric:create --wait- for -provisioning fabric:mq-create --no-ssl --parent-profile mq-base --group fabric-group --kind MasterSlave --data /mnt/nfs/fuse-shared/fabricFaframTest fabric-broker container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M " --host instance.a --user ***** --password ***** container1 container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M " --host instance.b --user ***** --password ***** container2 wait- for -provisioning mq-create --no-ssl --assign-container container1 --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker mq-create --no-ssl --assign-container container2 --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker block connection between root container and container with master (let's say master is on container1) iptables -A INPUT -s instance.a -j DROP && iptables -A OUTPUT -d instance.a -j DROP watch the logs the fabric chooses new master which starts booting (it will take up to 30s) the old master sometimes go down also in 30s interval, but usually it takes much more time (in minutes) before it realizes that it is disconnected (so broker is still running) If your store does not support locking properly (or it is disabled in broker.xml) you will end up with two brokers for few minutes. If the store does supports locking properly the newly elected master will hang on filesystem lock. But the fabric will not know about old broker so the broker will be unreachable for clients which uses fabric discovery to determine master container. In my case FS locking was enabled on NFSv4 and you can see timeout between the new broker was elected as master and the broker actually started (since lock from old broker was released): instance.b.log 2016-05-26 08:34:34,054 | INFO | Group-3-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is now the master, starting the broker. 2016-05-26 08:34:34,055 | INFO | Group-3-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is being started. 2016-05-26 08:34:34,068 | INFO | AMQ-1-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | booting up a broker from: profile:broker.xml ... 2016-05-26 08:34:34,772 | INFO | AMQ-1-thread-1 | BrokerService | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Using Persistence Adapter: KahaDBPersistenceAdapter[/mnt/nfs/fuse-shared/fabricFaframTest/kahadb] 2016-05-26 08:34:34,838 | INFO | AMQ-1-thread-1 | SharedFileLocker | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Database /mnt/nfs/fuse-shared/fabricFaframTest/kahadb/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired 2016-05-26 08:37:05,404 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | KahaDB is version 5 2016-05-26 08:37:05,432 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovering from the journal @1:503 2016-05-26 08:37:05,438 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovery replayed 53 operations from the journal in 0.025 seconds. ... 2016-05-26 08:37:05,799 | INFO | AMQ-1-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker has started.
    • Sprint 7 - towards CR2

      I have A-MQ master/slave broker created using fabric on different ssh containers. When the container with master broker looses connection to the ensemble the broker instance is often shut down very late (it may take up to 5 minutes).

      It causes the problem since ensemble after 30s of connection loss realizes that the container is unreachable so it elects another container with broker as a master and the container (and so broker) starts.

      It may cause two broker instances to be running and so data corruption (in case if the underlying filesystem for broker storage does not support unreliable locking). But it causes issue even in case that the storage locking is reliable. When the locking is reliable the newly elected master instance cannot start up because of the lock of disconnected instance. It results in the situation that there is no broker you can connect to using fabric-discovery.

        1. 40s-shutdown.log
          33 kB
        2. container1-630167.log
          457 kB
        3. freeze-instance-171.tar.gz
          72 kB
        4. fuse-187.log
          63 kB
        5. fuse-187-no-retry.tar.gz
          10 kB
        6. instance.a.log
          361 kB
        7. instance.a-trace.log
          476 kB
        8. instance.b.log
          497 kB
        9. root-container.log
          301 kB
        10. thread-dump-167.txt
          116 kB
        11. threaddump-180.txt
          95 kB
        12. threaddump-187.txt
          79 kB
        13. threaddump-60.txt
          93 kB
        14. threadTimeoutTestFirstCut.diff
          9 kB

            ggrzybek Grzegorz Grzybek
            knetl.j@gmail.com Jakub Knetl (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: