Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2780

MCAST: multicast protocol based on positive acks

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Unresolved
    • Icon: Major Major
    • 5.4
    • None
    • None
    • False
    • None
    • False
    • 0
    • 0% 0%

      Multicasts are flow-controlled, however, retransmissions are not. This is because MFC is above NAKACK2. This causes problems when we have a lot of message drops: 'retransmission storms' might overwhelm the switch / receiver queues, and cause more traffic than the original messages, leading to even more drops.

      Placing MFC below NAKACK2 also leads to problems:

      • When both original and retransmitted messages block on 0 credits in MFC, the thread pool will soon be exhausted with retransmission requests.
      • If we tag retransmitted messages as DONT_BLOCK, then retransmitted messages will get dropped by MFC on 0 credits. This 'favoring' of original messages over retransmissions leads to ever widening xmit windows on the receivers, eventually causing memory exhaustion.

      The xmit window (implemented by Table) can widen because it's not fixed size, but expands and shrinks dynamically.

      We therefore need a fixed-size xmit window, which blocks senders when adding messages if there's not enough space. Enter MCAST:

      MCAST

      MCAST has fixed-sized sender and receiver windows (RingBufferSeqno). Conceptually, every member has 1 sender window, plus 1 receiver window per cluster member.

      A window has space for a max number of messages (capacity), with a low and high index:

      • On the sender, low = highest acked and high = highest sent
      • On the receiver, low = highest delivered and high = highest received

      The sender increments a seqno and adds it to the sender window. If there's not enough space, the send will block until there is.

      When a receiver receives a message, it adds the message; dropping it if the seqno is out of range, then delivers as many messages as possible (without a gap), increasing the low index. Then it sends an ack back to the sender.

      When the sender has acks from all receivers (cluster members), it computes the minimum and advances the low index. This unblocks blocked senders, so they may now be able to add their messages to the send window and send them.

      The receiver sends acks either after a number of messages have been received, or periodically (xmit_interval). It also periodically sends retransmission requests if it detects missing messages.

      Because the number of messages in transit cannot be higher than the number of senders times the window capacity, we have a natural flow control over both original and retransmitted messages. For example:

      • The window capacity of 2000 messages
      • If we have a cluster {A,B,C,D}, and only A is sending, the max number of messages in transit (and stored at every member) is 2000. If all members are sending, then it is 8000.

      We therefore don't need any flow control protocol (MFC) anymore.

      Caveats

      Because this is a design based on acks, not nacks (NAKACK2), it will not scale to hundreds of cluster members. However, note that the number of acks sent can be reduced, e.g. by only sending them every xmit_interval, by sending them on every Nth message, or by (possibly) piggybacking them on outgoing messages.

      Misc

      • RingBufferSeqno could also be used by UNICAST3 instead of Table. However, because most unicast-based applications use TCP (which flow controls both original and retransmitted messages) rather than UDP, the problem is not pressing. We can look at this later, possibly in a separate JIRA issue.
      • The design is described in ./doc/design/MCAST.txt

      [1] JGRP-2140

            rhn-engineering-bban Bela Ban
            rhn-engineering-bban Bela Ban
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: