Uploaded image for project: 'Application Server 3  4  5 and 6'
  1. Application Server 3 4 5 and 6
  2. JBAS-9456

JBoss 6.0.0 fails to restart HA Singletons after recovering from a split brain

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 6.0.0.Final
    • Clustering
    • None
    • Hide

      Extract jboss 6.0.0 on 2 boxes on a network.
      Deploy a singleton in the server/all/deploy-hasingleton/ directories.
      Start jboss running with the 'all' profile on both boxes.
      Wait for them to cluster.
      Check singleton is only running on one of the boxes.
      Kill the network (pull the cable or "iptables -A INPUT -s <ipaddress> -j DROP")
      Wait until singleton has started up on 2nd box.
      Enable the network (plug in cable or "iptables -F")
      Watch as both singletons remain running.

      Show
      Extract jboss 6.0.0 on 2 boxes on a network. Deploy a singleton in the server/all/deploy-hasingleton/ directories. Start jboss running with the 'all' profile on both boxes. Wait for them to cluster. Check singleton is only running on one of the boxes. Kill the network (pull the cable or "iptables -A INPUT -s <ipaddress> -j DROP") Wait until singleton has started up on 2nd box. Enable the network (plug in cable or "iptables -F") Watch as both singletons remain running.

      We've been running with JBoss 6.0.0 clustered across 2 boxes and running with a number of HA Singletons. A brief network outage caused the cluster to split and the HA Singletons to start up on the second box. After the network issues were resolved, the JBoss instances correctly re-clustered, but the HA Singletons remained running on both boxes.
      I believe that they should have automatically stopped and only the HA Singletons on the master node should have started back up.

      I've finally tracked the issue down to common/lib/jboss-ha-server-core.jar from the source code at
      http://grepcode.com/snapshot/repository.jboss.org/nexus/content/repositories/releases/org.jboss.cluster/jboss-ha-server-core/1.0.0.Final

      The bug is in the file:
      org/jboss/ha/core/framework/server/DistributedReplicantManagerImpl.java

      In the method:
      /**

      • Add a replicant to the replicants map.
      • @param key replicant key name
      • @param nodeName name of the node that adds this replicant
      • @param replicant Serialized representation of the replica
      • @return true, if this replicant was newly added to the map, false otherwise
        */
        protected boolean addReplicant(String key, String nodeName, Serializable replicant) { ConcurrentMap<String, Serializable> map = new ConcurrentHashMap<String, Serializable>(); ConcurrentMap<String, Serializable> existingMap = this.replicants.putIfAbsent(key, map); return (((existingMap != null) ? existingMap : map).put(nodeName, replicant) != null); }

      The last line of the method should be changed to:
      return (((existingMap != null) ? existingMap : map).put(nodeName, replicant) == null);

      addReplicant() should return true if the replicant wasn't previously in the map, which would happen if the Map.put() method returns null. It looks like the return value of this method is only checked when merging a split cluster.

      Probably affects JBoss 6.1.0 - not sure about 7.X.X though.

              pferraro@redhat.com Paul Ferraro
              maxrabbit Robert Hayward (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated: