Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-118

zookeeper fail-over causes broker downtime

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Major Major
    • None
    • None
    • None
    • None

      We have a setup with 3 machines, all having 1 root-container ans 1 child container.
      The child container runs a replicated leveldb broker, the root-containers are all part of the ensemble.

      Scenario:
      1) all containers are running
      2) shutdown child number 3, broker still works as expected, fail-over if needed.
      3) shutdown root container 3, the master broker (child container 1) now gets the message "Demoted to slave", while the second broker gets the message "Not enough cluster members connected to elect a master."

      This causes downtime of the service while this wasn't expected when shutting down a slave broker. Majority should still remain at this point for both the zookeeper ensemble and the brokers.

      Here is exact steps I had performed:
      1. prepare three machines and install JBoss Fuse 6.1 on each of them;
      2. on machine A, start JBoss Fuse 6.1 container and create a fabric from console by "fabric:create --wait-for-provisioning". Then supply username/password from prompt;
      3. on machine B, start JBoss Fuse 6.1 container and join it to the fabric that was created on machine A from console by "fabric:join --zookeeper-password <password> <machineA_hostname> <newContainerName>". It will be stopped because changing of default container name and you will need to start it again.
      4. on machine C, start JBoss Fuse 6.1 container and join it to the fabric on machine A from console by "fabric:join --zookeeper-password <password> <machineA_hostname> <newContainerName>". Similarly, you will need to start it again.
      5. add the machine B container and machine C container to ensemble list by "ensemble-add <machineB_containerName> <machineC_containerName>" from machine A container console;
      6. start hawtio
      7. go to "Runtime" -> "MQ" and click "+Broker" to create a new broker profile. on "Default" tab, Choose "Replicated" value for property "Kind". Fill in a new "Group" name, say "replicated". Then fill in "Broker name" field with a name of your choice, say "test". Leave everything unchanged for "Advance" tab. Then click "+Create Broker". It will create a new broker profile called: "mq-broker-replicated.test";
      8. create a child container with the profile "mq-broker-replicated.test" for each of three ensemble containers. You can do it by click "+Create" from "Runtime" -> "Containers" and then choose appropriate "parent container" and fill out a "container name" for the child container. Remember to choose the profile "mq-broker-replicated.test" to apply to the new child container before clicking "Create And Start Container" button.
      9. the replicated broker on machine C should be a slave broker. Then stop this broker from hawtio, the rest of two brokers should still be running without any problem;
      10. go the machine C and manually stop the root container on the machine C. Because the root container on machine C was not created through Fabric, so you won't be able to stop it from hawtio. Then you should see the error described.

      Although the command "cluster-list fusemq" still shows correct status, like

      JBossFuse:karaf@root> cluster-list fusemq
      [cluster]                      [masters]                      [slaves]         [services]
      replica
         test                        brokerA                       brokerB        tcp://machinA:61521
      

      but in fact none of them was running.

              Unassigned Unassigned
              rhn-support-qluo Joe Luo
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: