Loading...

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: JBoss A-MQ 6.3.x
Affects Version/s: JBoss A-MQ 6.3
Component/s: broker, fabric8
Labels:
None

Fix Build:
CR2
GSS Priority:
Steps to Reproduce:
Hide

Note that root container and both ssh containers must be on different machines
(so that you can block connection between particular pair using iptables).

create containers and broker

fabric:create --wait-for-provisioning fabric:mq-create --no-ssl --parent-profile mq-base --group fabric-group --kind MasterSlave --data /mnt/nfs/fuse-shared/fabricFaframTest fabric-broker container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M " --host instance.a --user ***** --password ***** container1 container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M " --host instance.b --user ***** --password ***** container2 wait-for-provisioning mq-create --no-ssl --assign-container container1 --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker mq-create --no-ssl --assign-container container2 --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker

block connection between root container and container with master (let's say master is on container1)

iptables -A INPUT -s instance.a -j DROP && iptables -A OUTPUT -d instance.a -j DROP

watch the logs

the fabric chooses new master which starts booting (it will take up to 30s)

the old master sometimes go down also in 30s interval, but usually it takes much more time (in minutes) before it realizes that it is disconnected (so broker is still running)

If your store does not support locking properly (or it is disabled in broker.xml) you will end up with two brokers for few minutes.

If the store does supports locking properly the newly elected master will hang on filesystem lock. But the fabric will not know about old broker so the broker will be unreachable for clients which uses fabric discovery to determine master container.

In my case FS locking was enabled on NFSv4 and you can see timeout between the new broker was elected as master and the broker actually started (since lock from old broker was released):

instance.b.log

2016-05-26 08:34:34,054 | INFO | Group-3-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is now the master, starting the broker. 2016-05-26 08:34:34,055 | INFO | Group-3-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is being started. 2016-05-26 08:34:34,068 | INFO | AMQ-1-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | booting up a broker from: profile:broker.xml ... 2016-05-26 08:34:34,772 | INFO | AMQ-1-thread-1 | BrokerService | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Using Persistence Adapter: KahaDBPersistenceAdapter[/mnt/nfs/fuse-shared/fabricFaframTest/kahadb] 2016-05-26 08:34:34,838 | INFO | AMQ-1-thread-1 | SharedFileLocker | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Database /mnt/nfs/fuse-shared/fabricFaframTest/kahadb/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired 2016-05-26 08:37:05,404 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | KahaDB is version 5 2016-05-26 08:37:05,432 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovering from the journal @1:503 2016-05-26 08:37:05,438 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovery replayed 53 operations from the journal in 0.025 seconds. ... 2016-05-26 08:37:05,799 | INFO | AMQ-1-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker has started.
Show
Note that root container and both ssh containers must be on different machines (so that you can block connection between particular pair using iptables). create containers and broker fabric:create --wait- for -provisioning fabric:mq-create --no-ssl --parent-profile mq-base --group fabric-group --kind MasterSlave --data /mnt/nfs/fuse-shared/fabricFaframTest fabric-broker container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M " --host instance.a --user ***** --password ***** container1 container-create-ssh --jvm-opts "-Xms1024M -Xmx3048M -XX:PermSize=128M -XX:MaxPermSize=512M " --host instance.b --user ***** --password ***** container2 wait- for -provisioning mq-create --no-ssl --assign-container container1 --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker mq-create --no-ssl --assign-container container2 --parent-profile mq-base --group fabric-group --data /mnt/nfs/fuse-shared/fabricFaframTest --kind MasterSlave fabric-broker block connection between root container and container with master (let's say master is on container1) iptables -A INPUT -s instance.a -j DROP && iptables -A OUTPUT -d instance.a -j DROP watch the logs the fabric chooses new master which starts booting (it will take up to 30s) the old master sometimes go down also in 30s interval, but usually it takes much more time (in minutes) before it realizes that it is disconnected (so broker is still running) If your store does not support locking properly (or it is disabled in broker.xml) you will end up with two brokers for few minutes. If the store does supports locking properly the newly elected master will hang on filesystem lock. But the fabric will not know about old broker so the broker will be unreachable for clients which uses fabric discovery to determine master container. In my case FS locking was enabled on NFSv4 and you can see timeout between the new broker was elected as master and the broker actually started (since lock from old broker was released): instance.b.log 2016-05-26 08:34:34,054 | INFO | Group-3-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is now the master, starting the broker. 2016-05-26 08:34:34,055 | INFO | Group-3-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker is being started. 2016-05-26 08:34:34,068 | INFO | AMQ-1-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | booting up a broker from: profile:broker.xml ... 2016-05-26 08:34:34,772 | INFO | AMQ-1-thread-1 | BrokerService | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Using Persistence Adapter: KahaDBPersistenceAdapter[/mnt/nfs/fuse-shared/fabricFaframTest/kahadb] 2016-05-26 08:34:34,838 | INFO | AMQ-1-thread-1 | SharedFileLocker | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Database /mnt/nfs/fuse-shared/fabricFaframTest/kahadb/lock is locked by another server. This broker is now in slave mode waiting a lock to be acquired 2016-05-26 08:37:05,404 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | KahaDB is version 5 2016-05-26 08:37:05,432 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovering from the journal @1:503 2016-05-26 08:37:05,438 | INFO | AMQ-1-thread-1 | MessageDatabase | 162 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-630069 | Recovery replayed 53 operations from the journal in 0.025 seconds. ... 2016-05-26 08:37:05,799 | INFO | AMQ-1-thread-1 | ActiveMQServiceFactory | 156 - io.fabric8.mq.mq-fabric - 1.2.0.redhat-630069 | Broker fabric-broker has started.

Sprint:
Sprint 7 - towards CR2

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

I have A-MQ master/slave broker created using fabric on different ssh containers. When the container with master broker looses connection to the ensemble the broker instance is often shut down very late (it may take up to 5 minutes).

It causes the problem since ensemble after 30s of connection loss realizes that the container is unreachable so it elects another container with broker as a master and the container (and so broker) starts.

It may cause two broker instances to be running and so data corruption (in case if the underlying filesystem for broker storage does not support unreliable locking). But it causes issue even in case that the storage locking is reliable. When the locking is reliable the newly elected master instance cannot start up because of the lock of disconnected instance. It results in the situation that there is no broker you can connect to using fabric-discovery.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

40s-shutdown.log
33 kB
2016/08/17 3:38 AM
container1-630167.log
457 kB
2016/08/16 7:15 AM
freeze-instance-171.tar.gz
72 kB
2016/08/18 8:58 AM
fuse-187.log
63 kB
2016/09/27 11:07 AM
fuse-187-no-retry.tar.gz
10 kB
2016/09/27 12:33 PM
instance.a.log
361 kB
2016/05/26 8:46 AM
instance.a-trace.log
476 kB
2016/05/27 8:44 AM
instance.b.log
497 kB
2016/05/26 8:46 AM
root-container.log
301 kB
2016/05/26 8:46 AM
thread-dump-167.txt
116 kB
2016/08/16 8:13 AM
threaddump-180.txt
95 kB
2016/05/27 8:44 AM
threaddump-187.txt
79 kB
2016/09/27 9:46 AM
threaddump-60.txt
93 kB
2016/05/27 8:44 AM
threadTimeoutTestFirstCut.diff
9 kB
2016/06/10 6:35 PM

blocks

ENTMQ-1638 Validate Zookeeper-based broker master-slave coordination in a fabric environment

Closed

is related to

ENTESB-6581 Upgrade curator dependency and shaded classes to 2.11.1

Closed

relates to

ENTESB-5964 [6.2.1.R4 / 6.3.0.CR1] Creating ensemble results in Client is not started in containers

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates