Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1451

Group gets stuck with inconsistent views

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.1
    • 3.0.9
    • None

      Same stress test as in JGRP-1450 etc: a group of four members, keep killing two (picked at random), expect that the group will eventually heal itself.

      This one's rather a complicated sequence of events, if I've understood it correctly. I'll do my best to explain - but do ask if something's not clear or you'd like to see more details.

      • start with everyone agreeing that the view is [C, D, B, A]
      • kill C and D
      • On seeing this, A's FD_SOCK pinger tries but fails to connect to B
        • I think this is a race where previously D was monitoring B, and now A wants to monitor B
        • B hasn't yet spotted that D has gone, and so is not ready to accept a new connection from A
        • This is a bit of a guess, but I don't think this detail is critical.
      • So now A suspects everyone else and forms a view [A].
      • Meanwhile B only suspects C and D, so forms a view [B, A]

      So far, I think, this is OK. The two sub-groups have different coordinators, so I expect that if everything stayed static here then in due course we'd get a merge and all would be well.

      • C and D restart. They both join B's sub-group.
      • So now A has [A], and B, C and D all have [B, A, C, D]

      Again, I think that this is still OK and should be resolved by a merge soon enough.

      • Now B and C are killed.
        • D sees that the new view would be [A, D] in which it would not be coordinator. So it doesn't install any new view.
        • A doesn't care

      I'm not sure what would happen if we left things alone now: ie whether the group would recover or not. But in fact the stress test restarted B and C, so we go on...

      • B and C restart. Now they both join A's subgroup (C first, as it happens).
      • So A, B and C all end up with the view [A, C, B]
      • Meanwhile D still thinks that the view is [B, A, C, D]

      Now we seem to have a problem (and in my test, this is where things stopped happening):

      • A declines to lead a merge: it regularly logs "I (10.239.0.1) won't be the merge leader"
        • Presumably it is deciding that B would be a better merge leader
      • But B doesn't think that it's a coordinator, so it won't merge either.

      So we're stuck, with two different views!

      How is this situation expected to resolve itself?

      Thanks

      David

        1. 1451Repro.zip
          4 kB
        2. MyTest.java
          3 kB

              rhn-engineering-bban Bela Ban
              dimbleby David Hotham (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: