Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2380

Sometimes cluster members are not discovered when using TCPGOSSIP

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Cannot Reproduce
    • Icon: Minor Minor
    • 4.1.6
    • 4.0.19
    • None

      Sometimes new member can't join existing cluster if TCPGOSSIP is used with use_nio property set to true. In such case new member creates its own cluster with only one member of itself. After some period of time MERGE3 protocol merges these two clusters into one, but if min_interval/max_interval values are large, it may take a while.

      For some reason, first try of initial discovery always finishes due to join_timeout. In this case only a few members are discovered with no coordinator.
      If we are lucky enough, GMS prints following log message: "I (WO-KIT-967-28892) am not the first of the nodes, waiting for another client to become coordinator" and makes second attempt to join cluster which now takes a few milliseconds and succeeds (see logs_success.txt). In case of failure, GMS prints "I (WO-KIT-967-14786) am the first of the nodes, will become coordinator" and creates new cluster with only one member (see logs_failure.txt).

      The expectations are that first try of the initial discovery should not fail due to the timeout and it should be as fast as the second one is.

      Workaround: set use_nio to false (or just remove it from the stack configuration)

        1. jgroups.xml
          1.0 kB
          Pavlo Fedyna
        2. logs_failure.txt
          26 kB
          Pavlo Fedyna
        3. logs_success.txt
          3.46 MB
          Pavlo Fedyna

              rhn-engineering-bban Bela Ban
              pavlo_fedyna Pavlo Fedyna (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: