Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1450

Views go wrong when two members leave simultaneously

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.1
    • 3.0.9
    • None

      Testcase essentially the same as in JGRP-1443 and JGRP-1449: ie a group of 4 members, where I simultaneously kill two at random and let them restart; and expect that the group should heal itself. In order to rule out SEQUENCER-related issues, I've removed that from the stack.

      I've got into a situation where:

      • members A, B, C see the same sequence of views and end up in a group [A, B, C]
      • but member D believes that the latest view is [C, D, A].

      I think I've identified the problem. First, here's the relevant trace (from D):

      2012-04-15 10:47:37.910 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - suspected members=[10.239.0.3], suspected_mbrs=[10.239.0.3]
      2012-04-15 10:47:37.961 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - suspected members=[10.239.0.2], suspected_mbrs=[10.239.0.3, 10.239.0.2]
      2012-04-15 10:47:37.961 [ViewHandler,TestCluster,10.239.0.4] DEBUG org.jgroups.protocols.pbcast.GMS - members are [10.239.0.3, 10.239.0.2, 10.239.0.4, 10.239.0.1], coord=10.239.0.4: I'm the new coord !
      2012-04-15 10:47:38.011 [ViewHandler,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.pbcast.GMS - 10.239.0.4: new members=[], suspected=[10.239.0.2], leaving=[], new view: [10.239.0.3|629] [10.239.0.3, 10.239.0.4, 10.239.0.1]
      2012-04-15 10:47:38.012 [ViewHandler,TestCluster,10.239.0.4] TRACE org.jgroups.protocols.pbcast.GMS - 10.239.0.4: mcasting view [10.239.0.3|629] [10.239.0.3, 10.239.0.4, 10.239.0.1] (3 mbrs)

      It looks to me as though what has happened is D has received separate reports that B and C are suspected, and correctly spotted that in that case he'll be coordinator in a new group [D, A]. But then when he actually becomes coordinator, he only remembers that B is suspected, so sends out a bogus view.

      If this is correct, I think that the bug is in ParticipantGmsImpl.java at the end of handleMembershipChange. I think that the final loop should be made for suspected_mbrs (before clearing ths value) and not for suspectedMembers.

      Perhaps this is a bit speculative - you'll be able to tell me if I'm on the wrong track!

      I'll keep the full trace so that we can do further analysis if required; and I'll try out a fix along the lines outlined above.

              rhn-engineering-bban Bela Ban
              dimbleby David Hotham (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: