Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-348

UNICAST: incorrect sequence numbers after merge if subgroups didn't completely exclude each other

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.5
    • 2.4
    • None
    • High

      Mail from David Foregt:
      Hi Bela,
      Still have an issue with JGroup 2.4 with UNICAST after applying your
      recommended settings. We spent more time analyzing the issue and found the
      exact scenario that cause the problem:

      • We have multiple nodes running on machine A (a1, a2, a3, a4...a15) an one
        node running on machine B (b1).
      • b1 node is started first (coord) then all a's nodes are started.
        When all nodes are active in the group we disconnected machine A from the
        network.
      • After ~10 sec all a's see b1 as dead and a new view is propagated to all
        a's nodes and connection table for b1 entry is cleared for all a's nodes.
      • b1 start seeing a's node as dead one by one every ~10 sec (as define by FD
        / VERIFY_SUSPECT) after 30 sec b1's view is (a4, a5...a15) and we
        reconnected the network cable on machine A. (b1 connection table was cleared
        for only a1...a3)
      • After A reconnect to the network a merge was done and all nodes are back
        in the group and are able to exchange Multicast message.
      • Because b1 did not detect a4...a15 as dead when it send a unicast message
        to those nodes the seqno has not been reset to 1. When a4 receive the first
        unicast message from b1 (because its connection table was cleared for b1) it
        create at line 453 of UNICAST a new AckReceiverWindow with initial_seqno = 1
        and add the received message (that has a seqno > 1) in the new
        AckReceiverWindow then all subsequent unicast message received from b1 are
        added in this new AckReceiverWindow and when remove is called at line 470 of
        UNICAST it always return null because the AckReceiverWindow::next_to_remove
        is equal to 1 and the message that we are adding to AckReceiverWindow have a
        seqno > 1.

      The result is that a4...a15 will never be able to receive any other unicast
      msg from b1. This is reproducible all the time.

      Our quick fix that look to work fine is to change UNICAST line 453 as
      following (I am not sure about potential bug introduce by this):

      entry.received_msgs=new AckReceiverWindow(seqno);

              rhn-engineering-bban Bela Ban
              rhn-engineering-bban Bela Ban
              Votes:
              1 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: