In UNICAST2, if we have a connection between sender A and receiver B, and B closes the connection (but not A), then A can end up with missing messages in its send table.
- A sends messages to B
- A has an entry for B in its send-table: B: 10|20 (lowest sent=10, highest sent=20)
- B has an entry for A in its recv-table: A: 10|20 (lowest received=10, highest received=20)
- B now gets a view that doesn't contain A and closes its connection to A
- This results in B's connection to A getting removed
- A now sends message A::21
- B doesn't find an entry in its recv-table for A and sends GET-FIRST-SEQNO to A
- A receives the request and sends message A::11 first - A:21 to B. These messages are sent unreliably, so they can get dropped. Let's assume (for this example) that some of them are dropped.
- B does receive A::11 first and creates an entry for A in its recv-table: A: 11|21 (next to be received is A:12)
- Now a spurious STABLE(A::15) message by B is received by A
- This can happen when B sent the STABLE message before its connection to A was removed, but the message was delayed, e.g. by garbage collection
- Note that the connection ID (conn-id is the same, so A will not reject the STABLE message by B
- A receives the STABLE message and purges elements up to 15, so its new entry for B is: B:: 15|21
- When B asks A for retransmission of messages A::12 - A:21, A can only retransmit messages 16-21, but not A::12 - A:15 !
Depending on which messages from A (which it sent unreliably on reception of GET-FIRST-SEQNO) were received by B, there would be never-ending retransmission requests from B to A for all or some messages in A[12..15], e.g.
WARN [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in retransmission table of B: [15 | 15 | 22] (X elements, Y missing)
Reordering of STABLE messages
In the worst case, as STABLE messages are not sent reliably and can therefore get dropped or reordered, if A gets another STABLE(10) message after the STABLE(15) message, the error message above would look like this:
WARN [org.jgroups.protocols.UNICAST2] A: (requester=B) message B::13 not found in retransmission table of B: [10 | 10 | 22] (X elements, Y missing)
Note that, with https://issues.jboss.org/browse/JGRP-1872 fixed, this cannot occur anymore.
There's no real solution but to upgrade to UNICAST3: when UNICAST3 receives a view, it doesn't remove receive (and send) connections immediately, but merely marks them as closed. The connection will only be removed after conn_close_timeout ms. If B therefore gets further messages from A, it will simply mark the receive connection as open and doesn't need to send a GET-FIRST-SEQNO message to A as it still has all of A's messages.
We could think of a connection establishment and teardown protocol used by all of the unicast protocols, which establishes connections similar to TCP. Senders would block until a connection is established etc and new conn-ids would be created, plus the current send- and receive- seqnos would be exchanged. This could also be used as a second line of defense, to re-establish the connection when a sender doesn't find messages requested for retransmission by a receiver. As an alternative, we could create a new protocol which syncs a receive table with a sender, e.g. https://issues.jboss.org/browse/JGRP-1875.
To mitigate the above issue, FD_ALL rather than FD should be used, so that members suspect each other more or less at the same time. This is not the case with FD, where multiple hung (or GC'ing) members take N * timeout time to suspect. With FD_ALL, chances are that A and B suspect each other and later, both establish a new connection.