Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1670

Cluster doesn't heal after first discovery fails

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.3.5, 3.4
    • 3.3.1
    • None
    • Hide
      • Start one jgroups instance using TCPPING and let it cluster with itself.
      • Configure an iptables rule preventing communication to this instance on the jgroups port.
      • Configure a second jgroups instance with TCPPING, listing the first as an initial_host.
      • After the second instance has started, you'll see it it's logs that the discovery timed out an a new view was created containing just the new instance.
      • Remove the iptables rule on the old instance.
      • See that neither instance ever reports a complete view of the cluster.
      Show
      Start one jgroups instance using TCPPING and let it cluster with itself. Configure an iptables rule preventing communication to this instance on the jgroups port. Configure a second jgroups instance with TCPPING, listing the first as an initial_host. After the second instance has started, you'll see it it's logs that the discovery timed out an a new view was created containing just the new instance. Remove the iptables rule on the old instance. See that neither instance ever reports a complete view of the cluster.

      When using TCPPING for discovery, if the very first attempt to discover the rest of the cluster fails (in my case, the connections are timing out due to a suspected EC2 issue), the new node decides that it is alone in the cluster and creates a new view of just itself.

      Later, when performing periodic discovery, the new node successfully connects to the existing cluster and sends a GET_MBRS_REQ but, since it's already in a view (the one where it's alone), it doesn't fill in its local IP address (see the logic or this at https://github.com/belaban/JGroups/blob/master/src/org/jgroups/protocols/Discovery.java#L254). This means that the old nodes in the cluster cannot send a reply or any other cluster messages to the new node. Thus the new node, which never get a response to its GET_MBRS_REQ, continues to think it's alone in the world and the old members, who got a record of a new cluster member but no address to communicate with it on, start logging that they're dropping messages to <UUID> because they have no physical address. The cluster never heals.

      If the physical address were included in all GET_MBRS_REQ messages (e.g. if the if statement were removed from the file I linked above), then, even if initial discovery fails, future discovery would succeed and the cluster would heal itself.

      This is a superficially similar issue to https://issues.jboss.org/browse/JGRP-1203, but, in that case, the cluster will heal once A performs a discovery later while in this one it never heals.

              rhn-engineering-bban Bela Ban
              boss_mc Andy Caldwell (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: