Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-179

Behavior of NAKACK and UNICAST in the presence of exceptions (e.g. OOM)

XMLWordPrintable

    • Icon: Task Task
    • Resolution: Done
    • Icon: Major Major
    • 2.3
    • 2.2.8, 2.2.9, 2.2.9.1
    • None

      When we send a message, it is tagged with a sequence number (seqno). When a member receives 1,2 and 4 from S, it will ask S to retransmit 3.
      When S sends a message, it does the following:
      1) Get a new seqno (prev seqno +1)
      2) Attach the seqno to the message
      3) Add the message to the retransmission table, in case it needs to be retransmitted

      If an exception (e.g. OOM) happens after #1 (having incremented the seqno), but before successfully completing #3, then we could lose a message.
      Example:

      • S sends a message
      • Seqno is 2, so we'll increment it to 3
      • Header with seqno=3 is added to the message
      • OOM occurs ! Note that message is not in the retransmission table
      • OOM is propagated up to the caller (Channel.send())
      • Next message is sent
      • Seqno is 3, so we increment it to 4
      • Header with seqno=4 is added to the message
      • Message is added to the retransmission table
      • Receiver gets msg with seqno=4, but last message was 2, so the receiver asks S for retransmission of 3 !
      • S doesn't find 3, error message looks like
        2006-01-06 14:51:58,093 ERROR [org.jgroups.protocols.pbcast.NAKACK] (requester=192.168.160.175:1025, local_addr=192.168.160.174:1025) message 192.168.160.174:1025::460 not found in sent msgs.
        Sent messages: [394 - 504] (108)
      • Receiver will never deliver messages higher than 2 from S !!

      SOLUTION:

      • In NAKACK.send()
      • The increment of the seqno and the addition to the retransmission table have to be done atomically, if there is an exception, the seqno
        must not be incremented !
      • Note that if there is an exception after adding the message and incrementing the seqno, e.g. when passing the message down, we don't care, because that message can now successfully be retransmitted

            rhn-engineering-bban Bela Ban
            rhn-engineering-bban Bela Ban
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: