Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1902

Simplify failure detection and merge timeout configuration


    • Icon: Enhancement Enhancement
    • Resolution: Done
    • Icon: Minor Minor
    • 3.6.2
    • 3.6
    • None

      FD/FD_ALL/FD_ALL2/FD_SOCK javadoc doesn't give any guidance as to how long it would take to detect a leaving member. MERGE2/MERGE3 javadoc also doesn't say how much it would take to detect that the network has healed.

      For an example of how misleading the current settings can be, I have seen MERGE3 take more than 20s to merge two partitions with min_interval=1000 and max_interval=5000. FD also detects a leaver after timeout * max_tries in the best case, and twice that if 2 consecutive nodes (in the members list) leave at the same time.

      The maximum time it takes to detect a leaver is of particular interest to Infinispan users, because Infinispan is supposed to protect against nodes leaving. But if the users don't configure a high enough RPC timeout in Infinispan, we don't get to detect the node leaving.

      Ideally, the user should be able to specify a maximum detection time, and the protocol should adjust the existing settings to meet that (most of the time).

            rhn-engineering-bban Bela Ban
            dberinde@redhat.com Dan Berindei (Inactive)
            0 Vote for this issue
            2 Start watching this issue