Uploaded image for project: 'Red Hat Fuse'
  1. Red Hat Fuse
  2. ENTESB-6866

Fabric MQ gateway is left in a "no endpoints available" condition after ZK sub-quorum situation

XMLWordPrintable

    • % %
    • Hide

      Set up a Fabric8 topology like the following container-list output. It is important that ssh2 and b2 are on a different host, or at least in some way capable of being disconnected at the network level, from everything else. So on one host we have two SSH containers (one down), a broker child container, and the gateway; on the other host we have an SSH container and a broker child container. The three SSH containers form a ZK ensemble, one node of which is down, so at present it is (just) above quorum.

      [id]     [version]  [type]  [connected]  [profiles]              [provision status]
      gateway  1.0        karaf   yes          default                 success           
                                               gateway-mq                                
      root*    1.0        karaf   yes          fabric                  success           
                                               fabric-ensemble-0001-1                    
        b1     1.0        karaf   yes          default                 success           
                                               mq-broker-default.test                    
      ssh1     1.0        karaf   yes          default                 success           
                                               fabric-ensemble-0001-2                    
        b2     1.0        karaf   yes          default                 success           
                                               mq-broker-default.test                    
      ssh2     1.0        karaf   no           default                 success           
                                               fabric-ensemble-0001-3         
      

      The default jboss-fuse-full profile must be removed from "root", else it will provide a broker of its own, which will interfere with the test.

      Using any JMS client, ensure that a connection can be made to the gateway on port 61616.

      Ensure (e.g., by looking in the logs) that b2 is the master broker. If it is not, shut down b1, then restart it, so it comes up as slave.

      Now disconnect ssh2 and b2 from the rest of the topology, either physically or using iptables. Note that a clean shutdown is not sufficient to reproduce the problem (I think) – we need network timeouts.

      This will put the fabric into a sub-quorum state. In this state we should expect odd things to happen, and they do. Leave it like this for several minutes – until ZooKeeper "Client has been stopped" messages are seen.

      Verify that a connection can no longer be made to the gateway. Although we have a running broker, that is reachable from the gateway, it does not seem to take over master role – presumably because the cluster is sub-quorum. In itself, I don't think this is a bug.

      Now reconnect ssh2 and b2. It will take a little while for the cluster to re-form, and for fabric:xxx commands to start working again.

      Note that it is still not possible to make a connection to the gateway. We see "No endpoints available" messages in the logs. Although brokers are available, the gateway will not route messages to either of them.

      Restarting the gateway recovers service.

      Show
      Set up a Fabric8 topology like the following container-list output. It is important that ssh2 and b2 are on a different host, or at least in some way capable of being disconnected at the network level, from everything else. So on one host we have two SSH containers (one down), a broker child container, and the gateway; on the other host we have an SSH container and a broker child container. The three SSH containers form a ZK ensemble, one node of which is down, so at present it is (just) above quorum. [id] [version] [type] [connected] [profiles] [provision status] gateway 1.0 karaf yes default success gateway-mq root* 1.0 karaf yes fabric success fabric-ensemble-0001-1 b1 1.0 karaf yes default success mq-broker- default .test ssh1 1.0 karaf yes default success fabric-ensemble-0001-2 b2 1.0 karaf yes default success mq-broker- default .test ssh2 1.0 karaf no default success fabric-ensemble-0001-3 The default jboss-fuse-full profile must be removed from "root", else it will provide a broker of its own, which will interfere with the test. Using any JMS client, ensure that a connection can be made to the gateway on port 61616. Ensure (e.g., by looking in the logs) that b2 is the master broker. If it is not, shut down b1, then restart it, so it comes up as slave. Now disconnect ssh2 and b2 from the rest of the topology, either physically or using iptables. Note that a clean shutdown is not sufficient to reproduce the problem (I think) – we need network timeouts. This will put the fabric into a sub-quorum state. In this state we should expect odd things to happen, and they do. Leave it like this for several minutes – until ZooKeeper "Client has been stopped" messages are seen. Verify that a connection can no longer be made to the gateway. Although we have a running broker, that is reachable from the gateway, it does not seem to take over master role – presumably because the cluster is sub-quorum. In itself, I don't think this is a bug. Now reconnect ssh2 and b2. It will take a little while for the cluster to re-form, and for fabric:xxx commands to start working again. Note that it is still not possible to make a connection to the gateway. We see "No endpoints available" messages in the logs. Although brokers are available, the gateway will not route messages to either of them. Restarting the gateway recovers service.

      This problem seems to be related to ENTESB-6254; however that bug was verified as being fixed in 6.2.1 R7 (I checked it myself), whilst this current problem is reproducible in R7. So either ENTESB-6254 was not fully fixed, or we have discovered a new way to elicit a very similar-looking failure.

      The problem I can reproduce is one in which the Fabric8 MQ gateway does not realize that there are brokers available after a network outage, even though one of the brokers is master and the other slave, and both are reachable on the network. It seems that the outage has to be sufficient to bring about a sub-quorum state in the ZK ensemble, so that neither broker is master for some time. However, it's possible that other ZK events may have a similar effect.

      In fact, I rather suspect that there are similar, but distinct, modes of failure, depending on exactly where in the topology the outage occurs, and what ZK roles each node is playing at the time. I have logs from the customer showing a situation where both brokers get stuck as slaves after the outage is resolved – but I have not so far been able to reproduce that.

      In all cases the practical impact is that after an outage, the gateway/broker system does not recover, and manual action always seems to be necessary to restore service.

            pantinor@redhat.com Paolo Antinori
            rhn-support-kboone Kevin Boone
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: