Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-395

Parallel FD

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • 2.5
    • 2.4
    • None
    • Medium

      With FD, when we have N nodes in a cluster and the switch crashes, every node will take roughly (N-1) * TIMEOUT ms to become a singleton cluster. This is because in regular FD, we only ping the next-in-line, e.g.

      • Cluster is A, B, C, D
      • The plug is pulled
      • Example B:
      • B decides that, after TIMEOUT ms, C is dead and excludes C from the pingable members
      • B then starts emitting a SUSPECT(C) until it gets a new view which excludes C
      • B switches to pinging D
      • After TIMEOUT ms, it switches to A
      • When all of C, D and A have been excluded, B decides to become a singleton cluster (and coordinator in it)

      SOLUTION:

      • Nodes don't actively ping other nodes. Instead, each nodes periodically multicasts a HEARTBEAT to the cluster
      • The HEARTBEAT is suppressed when a node sends data, because data counts as a heartbeat as well
      • Every node maintains a table of nodes and the last time we received either a message or a HEARTBEAT from that node
      • The counter is updated with the current time whenever that is the case
      • Periodically, we check whether any node has not sent us data/heartbeat for more the timeout ms. If so, we suspect it

              rhn-engineering-bban Bela Ban
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: