-
Bug
-
Resolution: Done
-
Major
-
2.4
-
None
-
High
Mail from David Foregt:
Hi Bela,
Still have an issue with JGroup 2.4 with UNICAST after applying your
recommended settings. We spent more time analyzing the issue and found the
exact scenario that cause the problem:
- We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
node running on machine B (b1).
- b1 node is started first (coord) then all a's nodes are started.
When all nodes are active in the group we disconnected machine A from the
network.
- After ~10 sec all a's see b1 as dead and a new view is propagated to all
a's nodes and connection table for b1 entry is cleared for all a's nodes.
- b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
/ VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
reconnected the network cable on machine A. (b1 connection table was cleared
for only a1...a3)
- After A reconnect to the network a merge was done and all nodes are back
in the group and are able to exchange Multicast message.
- Because b1 did not detect a4...a15 as dead when it send a unicast message
to those nodes the seqno has not been reset to 1. When a4 receive the first
unicast message from b1 (because its connection table was cleared for b1) it
create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
and add the received message (that has a seqno > 1) in the new
AckReceiverWindow then all subsequent unicast message received from b1 are
added in this new AckReceiverWindow and when remove is called at line 470 of
UNICAST it always return null because the AckReceiverWindow::next_to_remove
is equal to 1 and the message that we are adding to AckReceiverWindow have a
seqno > 1.
The result is that a4...a15 will never be able to receive any other unicast
msg from b1. This is reproducible all the time.
Our quick fix that look to work fine is to change UNICAST line 453 as
following (I am not sure about potential bug introduce by this):
entry.received_msgs=new AckReceiverWindow(seqno);
- relates to
-
JGRP-505 UNICAST: sequence numbering after merge leads to messages never being delivered
-
- Resolved
-