-
Feature Request
-
Resolution: Done
-
Major
-
None
-
None
[(Edited) email reply to David Forget]
FD_ALL provides a mechanism to detect rapidly when several nodes leave the groups at the same time. Once a failure is detected by a member of the cluster, a SUSPECT message is multicast to all nodes. Then, another logic takes place in all other nodes call VERIFY_SUSPECT. This is to give one last chance to the suspected node before it is removed from the view. The default configuration (udp.xml) makes it so that each node will then send 1 UDP messages to the suspected node which, if still alive, would reply with 1 UDP messages. Since we are running many nodes on the same host and each of these nodes are heavily multithreaded and running in a JVM with garbage collection, it happens that a false failure is detected. The consequences of this scenario given the current configuration result in an excessive number of messages sent on UDP. For example, in a 300 nodes cluster a single false failure case results in roughly 2 verify suspect messages (ARE_YOU_ALIVE+I_AM_NOT_DEAD) x 298 nodes that process the suspect message x 299 nodes that detect the failure = ~180 000 UDP messages
{assuming that all other nodes detect the false failure in a very short period of time (This is the worst case)}. This network flood can lead to more false detection and more UDP messages up to a network saturation and eventually a complete cluster / network degradation already know as "UDP Storm"...
So 300 nodes multicast the SUSPECT message. That's 300 (multicast) messages TOTAL.
Then every one of those (300*num_msgs) unicasts an ARE_YOU_DEAD message to the suspected member S. Since num_msgs=1, that's another 300 unicast messages, so our total is now 600.
The suspected member S now gets 300 unicasts and - as a response - send 300 unicast I_AM_NOT_DEAD responses.
So now the total is 900 messages (300 multicasts + (300 unicasts * num_msgs) + 300 unicasts. I can't follow your computation above with 180'000 messages !?
SOLUTION #1:
However, one thing we could is the following: similar to STABLE and sending of stability messages, instead of multicasting out a SUSPECT message immediately, we could schedule a task which goes off in a random period in range [1 .. max_mcast_wait_time]. Reception of a SUSPECT message for the same suspected member cancels this timer.
The effect of this staggered way of multicasting SUSPECT messages would be that ideally only 1 member multicasts a SUSPECT message. That's already a huge improvement and results in <1 SUSPECT multicast> + <2 * 300 unicast messages (in VERIFY_SUSPECT)>...
SOLUTION #2
In a second step what we could also do is to change VERIFY_SUSPECT, and have only the first N nodes in a cluster run the ARE_YOU_DEAD algorithm. This is similar to what I recently implemented in [1] to make discovery scale to large clusters. WDYT ?
- is related to
-
JGRP-100 Large-scale JGroups
- Closed