Loading...

XML

Word

Printable

Type: Feature Request
Resolution: Done
Priority: Major
Fix Version/s: 2.5
Affects Version/s: 2.4
Labels:
None

Estimated Difficulty:
Medium

SFDC Cases Links:
SFDC Cases Counter:
SFDC Cases Open:

With FD, when we have N nodes in a cluster and the switch crashes, every node will take roughly (N-1) * TIMEOUT ms to become a singleton cluster. This is because in regular FD, we only ping the next-in-line, e.g.

Cluster is A, B, C, D
The plug is pulled
Example B:
B decides that, after TIMEOUT ms, C is dead and excludes C from the pingable members
B then starts emitting a SUSPECT(C) until it gets a new view which excludes C
B switches to pinging D
After TIMEOUT ms, it switches to A
When all of C, D and A have been excluded, B decides to become a singleton cluster (and coordinator in it)

SOLUTION:

Nodes don't actively ping other nodes. Instead, each nodes periodically multicasts a HEARTBEAT to the cluster
The HEARTBEAT is suppressed when a node sends data, because data counts as a heartbeat as well
Every node maintains a table of nodes and the last time we received either a message or a HEARTBEAT from that node
The counter is updated with the current time whenever that is the case
Periodically, we check whether any node has not sent us data/heartbeat for more the timeout ms. If so, we suspect it

Assignee:: Bela Ban

Reporter:: Bela Ban

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Created:: 2006/12/21 8:24 AM

Updated:: 2006/12/22 6:56 AM

Resolved:: 2006/12/22 6:56 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide