Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1690

View and digest have to match

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.4
    • None
    • None

      In some cases a view and a digest are returned, e.g. v2={A,B,C} and digest=[25,10,17]. This means the highest delivered seqno are A=25, B=10 and C=17.

      View and digest are returned in

      • Responses to a new joiner: JoinRsp
      • Merge view installations: GMS$GmsHeader.INSTALL_MERGE_VIEW
      • Merge responses: GMS$GmsHeader.MERGE_RSP (to be verified)

      However, in some edge cases we could potentially end up with digests which don't match the view, e.g. digest=[25,10]. This would mean that there is no entry for C, and - the way this currently works - the resulting digest would have a 0 seqno for C !

      The above scenario can happen as follows:

      • The view is v1={A,B}. A is the coordinator
      • C joins
      • A broadcasts v2={A,B,C}.
      • A installs v2, but doesn't yet set the digest (NAKACK.setDigest())
      • D joins
      • Meanwhile, C sent 50 messages and STABLE garbage-collected C at 45
      • A creates a new view v3={A,B,C,D}
      • A gets the digest from NAKACK: [A,B] and adds D (at 0)
      • A sends a JOIN-RSP with v3={A,B,C,D} and digest=[A,B,C,D] to D. Note that C is 0.
        • The reason for this is that we create a MutableDigest with v3 with all seqnos being 0. Then we iterate through the digest and set the seqnos. However, since C is not set, its seqno is 0 !
      • D installs the JOIN-RSP. It thinks the seqno for C is 0. The problem now is that when C sends message #51, D will ask it for retransmission of [1-50], but C can't furnish them as it already purged messages 1-45. This leads to endless retransmissions.
      • A (belatedly) sets the digest=[A,B,C] for v2 in NAKACK

      SOLUTION:

      • Make sure (in the coordinator) that view and digest match. E.g. a MutableDigest could initialize all seqnos to -1 and - if after setting all values from the digest retrieved from NAKACK - throw an exception if one of the seqnos is still -1.
        • We could retry fetching the digest from NAKACK for a number of tries before giving up
        • In the worst case, we wouldn't send a JOIN-RSP to the new joiner, but since the joiner would retry, this is not a problem.
      • Alternatively, we could have the client (ClientGmsImpl) check the digest and retry if some of the seqnos are -1.

              rhn-engineering-bban Bela Ban
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: