Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-6007

Customer feedback: Document split brain issue in replicated HA topology

XMLWordPrintable

      Link: https://access.redhat.com/documentation/en/red-hat-jboss-enterprise-application-platform/7.0/single/configuring-messaging/#data_replication

      The documentation describes the behavior of Backup node when it loses the connection to its Live.

      Much like in the shared-store case, when the live server stops or crashes, its replicating backup will become active and take over its duties. Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active. If it loses communication to its liver server plus more than half the other servers in the cluster, the backup will wait and try reconnecting with the live server. This avoids a split brain situation.

      The paragraph makes feeling that if the Live-Backup pair is part of the cluster with other nodes, the split brain situation cannot occur. The mechanism solves the problem only from Backup POV. There is no description what happens with Live if it is disconnected. It should be noticed that Live stays active.

      The image [1] shows the situation when the Live is disconnected from network. If the connection between Live and the router is broken, the Backup loses connection to its Live and it still can connect to more than half the servers in cluster, hence it becomes active. Both Live and Backup are active. At this point two undesired situation can happen:

      1. Remote clients do failover on Backup and the Live has local clients (e.g. MDB). Both nodes have completely different journals -> split brain.
      2. Remote clients do failover on Backup and someone fix the broken connection. Old clients communicate with Backup, but new clients are connected to Live -> split brain.

      In the documentation it should be mentioned that in case of network failures, there is always risk of split brain. Maybe there could be a chapter with the name Limitations of data replication.

      [1]

            rhn-support-pfestoso_jira Phil Festoso (Inactive)
            eduda_jira Erich Duda (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: