Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 3.6
Affects Version/s: None
Labels:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In UNICAST2, if we have a connection between sender A and receiver B, and B closes the connection (but not A), then A can end up with missing messages in its send table.
Example:

A sends messages to B
A has an entry for B in its send-table: B: 10|20 (lowest sent=10, highest sent=20)
B has an entry for A in its recv-table: A: 10|20 (lowest received=10, highest received=20)
B now gets a view that doesn't contain A and closes its connection to A
- This results in B's connection to A getting removed
A now sends message A::21
B doesn't find an entry in its recv-table for A and sends GET-FIRST-SEQNO to A
A receives the request and sends message A::11 first - A:21 to B. These messages are sent unreliably, so they can get dropped. Let's assume (for this example) that some of them are dropped.
B does receive A::11 first and creates an entry for A in its recv-table: A: 11|21 (next to be received is A:12)
Now a spurious STABLE(A::15) message by B is received by A
- This can happen when B sent the STABLE message before its connection to A was removed, but the message was delayed, e.g. by garbage collection
- Note that the connection ID (conn-id is the same, so A will not reject the STABLE message by B
A receives the STABLE message and purges elements up to 15, so its new entry for B is: B:: 15|21
When B asks A for retransmission of messages A::12 - A:21, A can only retransmit messages 16-21, but not A::12 - A:15 !

Depending on which messages from A (which it sent unreliably on reception of GET-FIRST-SEQNO) were received by B, there would be never-ending retransmission requests from B to A for all or some messages in A[12..15], e.g.

WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in 
retransmission table of B: [15 | 15 | 22] (X elements, Y missing)

Reordering of STABLE messages

In the worst case, as STABLE messages are not sent reliably and can therefore get dropped or reordered, if A gets another STABLE(10) message after the STABLE(15) message, the error message above would look like this:

WARN  [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in
retransmission table of B: [10 | 10 | 22] (X elements, Y missing)

Note that, with https://issues.jboss.org/browse/JGRP-1872 fixed, this cannot occur anymore.

Solution

There's no real solution but to upgrade to UNICAST3: when UNICAST3 receives a view, it doesn't remove receive (and send) connections immediately, but merely marks them as closed. The connection will only be removed after conn_close_timeout ms. If B therefore gets further messages from A, it will simply mark the receive connection as open and doesn't need to send a GET-FIRST-SEQNO message to A as it still has all of A's messages.

We could think of a connection establishment and teardown protocol used by all of the unicast protocols, which establishes connections similar to TCP. Senders would block until a connection is established etc and new conn-ids would be created, plus the current send- and receive- seqnos would be exchanged. This could also be used as a second line of defense, to re-establish the connection when a sender doesn't find messages requested for retransmission by a receiver. As an alternative, we could create a new protocol which syncs a receive table with a sender, e.g. https://issues.jboss.org/browse/JGRP-1875.

To mitigate the above issue, FD_ALL rather than FD should be used, so that members suspect each other more or less at the same time. This is not the case with FD, where multiple hung (or GC'ing) members take N * timeout time to suspect. With FD_ALL, chances are that A and B suspect each other and later, both establish a new connection.

is related to

JGRP-1807 UNICAST: skipping of seqnos

Resolved

JGRP-1872 Table: purge() with incorrect seqnos moves HD/LOW back

Resolved

JGRP-1875 UNICAST3/UNICAST2: sync receiver table with sender table

Resolved

Assignee:: Bela Ban

Reporter:: Bela Ban

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2014/08/25 9:33 AM

Updated:: 2014/09/09 7:52 AM

Resolved:: 2014/09/09 7:52 AM

Details

Description

Reordering of STABLE messages

Solution

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates