Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: jboss-fuse-6.2-patches, fuse-6.x-GA
Affects Version/s: jboss-fuse-6.2.1
Component/s: Fabric8 v1
Labels:
None

Fuse Progress Bar:

% %
GSS Priority:
Steps to Reproduce:
Hide

Set up a Fabric8 topology like the following container-list output. It is important that ssh2 and b2 are on a different host, or at least in some way capable of being disconnected at the network level, from everything else. So on one host we have two SSH containers (one down), a broker child container, and the gateway; on the other host we have an SSH container and a broker child container. The three SSH containers form a ZK ensemble, one node of which is down, so at present it is (just) above quorum.

[id] [version] [type] [connected] [profiles] [provision status] gateway 1.0 karaf yes default success gateway-mq root* 1.0 karaf yes fabric success fabric-ensemble-0001-1 b1 1.0 karaf yes default success mq-broker-default.test ssh1 1.0 karaf yes default success fabric-ensemble-0001-2 b2 1.0 karaf yes default success mq-broker-default.test ssh2 1.0 karaf no default success fabric-ensemble-0001-3

The default jboss-fuse-full profile must be removed from "root", else it will provide a broker of its own, which will interfere with the test.

Using any JMS client, ensure that a connection can be made to the gateway on port 61616.

Ensure (e.g., by looking in the logs) that b2 is the master broker. If it is not, shut down b1, then restart it, so it comes up as slave.

Now disconnect ssh2 and b2 from the rest of the topology, either physically or using iptables. Note that a clean shutdown is not sufficient to reproduce the problem (I think) – we need network timeouts.

This will put the fabric into a sub-quorum state. In this state we should expect odd things to happen, and they do. Leave it like this for several minutes – until ZooKeeper "Client has been stopped" messages are seen.

Verify that a connection can no longer be made to the gateway. Although we have a running broker, that is reachable from the gateway, it does not seem to take over master role – presumably because the cluster is sub-quorum. In itself, I don't think this is a bug.

Now reconnect ssh2 and b2. It will take a little while for the cluster to re-form, and for fabric:xxx commands to start working again.

Note that it is still not possible to make a connection to the gateway. We see "No endpoints available" messages in the logs. Although brokers are available, the gateway will not route messages to either of them.

Restarting the gateway recovers service.
Show
Set up a Fabric8 topology like the following container-list output. It is important that ssh2 and b2 are on a different host, or at least in some way capable of being disconnected at the network level, from everything else. So on one host we have two SSH containers (one down), a broker child container, and the gateway; on the other host we have an SSH container and a broker child container. The three SSH containers form a ZK ensemble, one node of which is down, so at present it is (just) above quorum. [id] [version] [type] [connected] [profiles] [provision status] gateway 1.0 karaf yes default success gateway-mq root* 1.0 karaf yes fabric success fabric-ensemble-0001-1 b1 1.0 karaf yes default success mq-broker- default .test ssh1 1.0 karaf yes default success fabric-ensemble-0001-2 b2 1.0 karaf yes default success mq-broker- default .test ssh2 1.0 karaf no default success fabric-ensemble-0001-3 The default jboss-fuse-full profile must be removed from "root", else it will provide a broker of its own, which will interfere with the test. Using any JMS client, ensure that a connection can be made to the gateway on port 61616. Ensure (e.g., by looking in the logs) that b2 is the master broker. If it is not, shut down b1, then restart it, so it comes up as slave. Now disconnect ssh2 and b2 from the rest of the topology, either physically or using iptables. Note that a clean shutdown is not sufficient to reproduce the problem (I think) – we need network timeouts. This will put the fabric into a sub-quorum state. In this state we should expect odd things to happen, and they do. Leave it like this for several minutes – until ZooKeeper "Client has been stopped" messages are seen. Verify that a connection can no longer be made to the gateway. Although we have a running broker, that is reachable from the gateway, it does not seem to take over master role – presumably because the cluster is sub-quorum. In itself, I don't think this is a bug. Now reconnect ssh2 and b2. It will take a little while for the cluster to re-form, and for fabric:xxx commands to start working again. Note that it is still not possible to make a connection to the gateway. We see "No endpoints available" messages in the logs. Although brokers are available, the gateway will not route messages to either of them. Restarting the gateway recovers service.

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This problem seems to be related to ~~ENTESB-6254~~; however that bug was verified as being fixed in 6.2.1 R7 (I checked it myself), whilst this current problem is reproducible in R7. So either ~~ENTESB-6254~~ was not fully fixed, or we have discovered a new way to elicit a very similar-looking failure.

The problem I can reproduce is one in which the Fabric8 MQ gateway does not realize that there are brokers available after a network outage, even though one of the brokers is master and the other slave, and both are reachable on the network. It seems that the outage has to be sufficient to bring about a sub-quorum state in the ZK ensemble, so that neither broker is master for some time. However, it's possible that other ZK events may have a similar effect.

In fact, I rather suspect that there are similar, but distinct, modes of failure, depending on exactly where in the topology the outage occurs, and what ZK roles each node is playing at the time. I have logs from the customer showing a situation where both brokers get stuck as slaves after the outage is resolved – but I have not so far been able to reproduce that.

In all cases the practical impact is that after an outage, the gateway/broker system does not recover, and manual action always seems to be necessary to restore service.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

gateway-core-1.2.0.redhat-621186-03.jar
1.16 MB
2017/06/07 5:07 AM
gateway-fabric-1.2.0.redhat-621186-03.jar
91 kB
2017/06/07 5:06 AM

relates to

ENTESB-6845 if an ensemble server is stopped, the gateway will intermittently send traffic to the wrong broker

Closed

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates