Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-8209

AMQ 7: There is no perfect replacement for active/passive mode on OpenShift

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Undefined Undefined
    • None
    • AMQ 7.11.0.GA
    • broker-core
    • None
    • False
    • None
    • False
    • Important

      Outside the OpenShift arena, many AMQ users set up active/passive broker pairs to provide fault tolerance. Typically, storage locking is used to control which broker is live and which is backup at a particular time. To be 'active' a broker must be able to take control of the storage. This not only ensures that exactly one broker can be in the active role at a time, but it protects the storage from 'split brain' scenarios.

      AMQ 6 supported an active/passive mode of operation on OpenShift, and Red Hat provided templates to set it up. Since AMQ 7 however, Red Hat has discouraged this mode of operation, because it conflicts with the pod liveness guarantee provided by Kubernetes. There is no need for an active/passive mode, we argued, because the alternative – a deployment with one replica – works just as well, and is simpler. Kubernetes will act to keep the single replica alive, instantiating it on a different node if necessary to maintain that one pod. Consequently, the current implementation of the AMQ 7 operator will not even allow an active/passive mode to be set up.

      The operator can set up a broker mesh, which will provide some of the HA guarantees of an active/passive pair, plus the potential for load sharing as well. But a mesh has problems with message ordering, durable subscribers, message groups, and some other JMS features. Often a single active/passive pair behaves more predictably.

      Unfortunately, there are scenarios when a single-replica deployment is not a perfect substitute for active/passive operation. The basic AMQ 7 deployment is a stateful set, and Kubernetes will actively prevent multiple instances of a pod with the same identity in such a set. Even with a single-replica deployment, the pod still has a fixed identity.

      This means that there are failure modes in OpenShift in which a pod can be lost, and a replacement will not be instantiated for some time. These problems do not arise with simple failure modes such as a pod crashing. Nor do they arise with an orderly shut-down of a node. Problems arise when a node fails in a disorderly way, particularly when it is separated from the rest of the cluster.

      In most cases, in order to avoid duplicating a pod with the same identify, Kubernetes will allow an extended period of time before assuming that the node is dead – typically many minutes. For that entire time period, the entire broker cluster – because it consists of a single replica – is unavailable.

      Although these kinds of failure modes are rare, they are possible, and the broker should be able to deal with them. It is unfortunate that we cannot offer the same responsiveness to a broker failure on OpenShift, that we can usually can on bare metal.

       

            Unassigned Unassigned
            rhn-support-kboone Kevin Boone
            Votes:
            10 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: