Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1807

UNICAST: skipping of seqnos

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.2.13, 3.5
    • None
    • None

      The log starts with:
      10-Mar-2014 13:21:47 WARN  [org.jgroups.protocols.UNICAST2] (OOB-105,shared=tcp) node1/web: (requester=node2/web) message node2/web::1511786 not found in retransmission table of node2/web:
      [1511785 | 1511785 | 1511857] (53 elements, 19 missing)
      
      The numbers are 1511786-1511804  for "not found in retransmission...."
      
      And end:
      10-Mar-2014 14:48:26 WARN  [org.jgroups.protocols.UNICAST2] (OOB-118,shared=tcp) node1/web: (requester=node2/web) message node2/web::1511804 not found in retransmission table of node2/web:
      [1511785 | 1511785 | 1514802] (2998 elements, 19 missing) 
      

      It seems that node1 is missing messages 1511785-1511804 which it sent to node2. Since a null message cannot be added to the sender table (due to the msg.isFlagSet() which would throw an NPE), I asume we're skipping a seqno:

      In UNICAST, UNICAST2 and UNICAST3 down(), if a seqno is skipped, we get endless retransmissions. Example:

      • We get the next seqno 1, add the message to the table and send it
      • We get the next seqno 2. However, if running is false, we don't add the message
      • We get the next seqno 3. Now running is true, and we add 3 to the table
        --> Now we have a missing message 2 which will always be null as it hasn't been added to the table

      This is highly unlikely, as I haven't been able to find a scenario where running flips from true to false to true quickly. If it flips from true to false, this is because stop() has been called. Also, in down(), we actually check running and return if false.
      In this scenario, the connections are all removed, so seqno is reset to 1.
      Anyway, I'm going to replace the while(running) loop with a do while(running) loop, so we always add the message to the table, even if running=false.

      [1] https://github.com/belaban/JGroups/blob/Branch_JGroups_3_2/src/org/jgroups/protocols/UNICAST2.java#L490

              rhn-engineering-bban Bela Ban
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: