-
Bug
-
Resolution: Done
-
Major
-
None
-
None
In some cases a view and a digest are returned, e.g. v2={A,B,C} and digest=[25,10,17]. This means the highest delivered seqno are A=25, B=10 and C=17.
View and digest are returned in
- Responses to a new joiner: JoinRsp
- Merge view installations: GMS$GmsHeader.INSTALL_MERGE_VIEW
- Merge responses: GMS$GmsHeader.MERGE_RSP (to be verified)
However, in some edge cases we could potentially end up with digests which don't match the view, e.g. digest=[25,10]. This would mean that there is no entry for C, and - the way this currently works - the resulting digest would have a 0 seqno for C !
The above scenario can happen as follows:
- The view is v1={A,B}. A is the coordinator
- C joins
- A broadcasts v2={A,B,C}.
- A installs v2, but doesn't yet set the digest (NAKACK.setDigest())
- D joins
- Meanwhile, C sent 50 messages and STABLE garbage-collected C at 45
- A creates a new view v3={A,B,C,D}
- A gets the digest from NAKACK: [A,B] and adds D (at 0)
- A sends a JOIN-RSP with v3={A,B,C,D} and digest=[A,B,C,D] to D. Note that C is 0.
- The reason for this is that we create a MutableDigest with v3 with all seqnos being 0. Then we iterate through the digest and set the seqnos. However, since C is not set, its seqno is 0 !
- D installs the JOIN-RSP. It thinks the seqno for C is 0. The problem now is that when C sends message #51, D will ask it for retransmission of [1-50], but C can't furnish them as it already purged messages 1-45. This leads to endless retransmissions.
- A (belatedly) sets the digest=[A,B,C] for v2 in NAKACK
SOLUTION:
- Make sure (in the coordinator) that view and digest match. E.g. a MutableDigest could initialize all seqnos to -1 and - if after setting all values from the digest retrieved from NAKACK - throw an exception if one of the seqnos is still -1.
- We could retry fetching the digest from NAKACK for a number of tries before giving up
- In the worst case, we wouldn't send a JOIN-RSP to the new joiner, but since the joiner would retry, this is not a problem.
- Alternatively, we could have the client (ClientGmsImpl) check the digest and retry if some of the seqnos are -1.
- is related to
-
JGRP-1317 Compress Digest and MutableDigest
- Resolved