-
Task
-
Resolution: Done
-
Major
-
2.2.8, 2.2.9, 2.2.9.1
-
None
When we send a message, it is tagged with a sequence number (seqno). When a member receives 1,2 and 4 from S, it will ask S to retransmit 3.
When S sends a message, it does the following:
1) Get a new seqno (prev seqno +1)
2) Attach the seqno to the message
3) Add the message to the retransmission table, in case it needs to be retransmitted
If an exception (e.g. OOM) happens after #1 (having incremented the seqno), but before successfully completing #3, then we could lose a message.
Example:
- S sends a message
- Seqno is 2, so we'll increment it to 3
- Header with seqno=3 is added to the message
- OOM occurs ! Note that message is not in the retransmission table
- OOM is propagated up to the caller (Channel.send())
- Next message is sent
- Seqno is 3, so we increment it to 4
- Header with seqno=4 is added to the message
- Message is added to the retransmission table
- Receiver gets msg with seqno=4, but last message was 2, so the receiver asks S for retransmission of 3 !
- S doesn't find 3, error message looks like
2006-01-06 14:51:58,093 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.160.175:1025, local_addr=192.168.160.174:1025) message 192.168.160.174:1025::460 not found in sent msgs.
Sent messages: [394 - 504] (108)
- Receiver will never deliver messages higher than 2 from S !!
SOLUTION:
- In NAKACK.send()
- The increment of the seqno and the addition to the retransmission table have to be done atomically, if there is an exception, the seqno
must not be incremented ! - Note that if there is an exception after adding the message and incrementing the seqno, e.g. when passing the message down, we don't care, because that message can now successfully be retransmitted