-
Feature Request
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
None
-
False
-
-
Multicasts are flow-controlled, however, retransmissions are not. This is because MFC is above NAKACK2. This causes problems when we have a lot of message drops: 'retransmission storms' might overwhelm the switch / receiver queues, and cause more traffic than the original messages, leading to even more drops.
Placing MFC below NAKACK2 also leads to problems:
- When both original and retransmitted messages block on 0 credits in MFC, the thread pool will soon be exhausted with retransmission requests.
- If we tag retransmitted messages as DONT_BLOCK, then retransmitted messages will get dropped by MFC on 0 credits. This 'favoring' of original messages over retransmissions leads to ever widening xmit windows on the receivers, eventually causing memory exhaustion.
The xmit window (implemented by Table) can widen because it's not fixed size, but expands and shrinks dynamically.
We therefore need a fixed-size xmit window, which blocks senders when adding messages if there's not enough space. Enter NAKACK4:
NAKACK4
NAKACK4 has fixed-sized sender and receiver windows (RingBufferSeqno). Conceptually, every member has 1 sender window, plus 1 receiver window per cluster member.
A window has space for a max number of messages (capacity), with a low and high index:
- On the sender, low = highest acked and high = highest sent
- On the receiver, low = highest delivered and high = highest received
The sender increments a seqno and adds it to the sender window. If there's not enough space, the send will block until there is.
When a receiver receives a message, it adds the message; dropping it if the seqno is out of range, then delivers as many messages as possible (without a gap), increasing the low index. Then it sends an ack back to the sender.
When the sender has acks from all receivers (cluster members), it computes the minimum and advances the low index. This unblocks blocked senders, so they may now be able to add their messages to the send window and send them.
The receiver sends acks either after a number of messages have been received, or periodically (xmit_interval). It also periodically sends retransmission requests if it detects missing messages.
Because the number of messages in transit cannot be higher than the number of senders times the window capacity, we have a natural flow control over both original and retransmitted messages. For example:
- The window capacity of 2000 messages
- If we have a cluster {A,B,C,D}, and only A is sending, the max number of messages in transit (and stored at every member) is 2000. If all members are sending, then it is 8000.
We therefore don't need any flow control protocol (MFC) anymore.
We also won't need STABLE any longer, because message stability is provided via ACKs from receivers to senders.
Main differences to NAKACK2
- Fixed-size retransmit window
- ACKs sent by receivers to senders -> unblocking of senders and message stability
New protocols NAKACK3 and NAKACK4
Because we don't want to introduce any incompatibility by modifying NAKACK2 (extending a new base class ReliableMulticast and/or bugs, NAKACK2 will be left unchanged. Instead:
- NAKACK2 is copied to NAKACK3
- NAKACK3 extends ReliableMulticast, and most of its functionality will be moved to that new parent class
- NAKACK4 also extends ReliableMulticast
Caveats
Because this is a design based on acks, not nacks (NAKACK2), it will not scale to hundreds of cluster members. However, note that the number of acks sent can be reduced, e.g. by only sending them every xmit_interval, by sending them on every Nth message, or by (possibly) piggybacking them on outgoing messages.
Misc
- RingBufferSeqno could also be used by UNICAST3 instead of Table. However, because most unicast-based applications use TCP (which flow controls both original and retransmitted messages) rather than UDP, the problem is not pressing. We can look at this later, possibly in a separate JIRA issue.
- The design is described in ./doc/design/NAKACK4.txt
[1] JGRP-2140
- impacts account
-
JGRP-2817 Asynchronous messages with slow receiver runs out of memory
- Resolved