JGRP-1426 etc, I want views and messages to be sequenced. That is, if node A sees view V1 before message M1, then no node shall see M1 before V1; and vice versa. To that end, I have SEQUENCER below GMS in my stack.
I've just seen a case where:
- we have two groups: [A], and [D, B, C]
- we perform a merge, to get the new view [A, D, B, C]
- A sends INSTALL_MERGE_VIEW to both A and D
- Now there's a race. D wins, and broadcasts the new view to [D, B, C]
- Application at D sees the view change, and this causes it to broadcast a message in the new view [A, D, B, C]
- This arrives at A and overtakes the installation of the view there
- So the application at A sees D's message before it sees the new view, whereas at D the opposite was true.
Here's some trace from A showing it installing the view on thread Incoming-1, and having this be overtaken by the message on thread Incoming-2:
(The last two lines are from my application. I also have trace from D showing it sending its message after installing the view, which I can provide if required).
- I think that this particular case could be fixed by removing the special case that goes "If we're the only member the VIEW is broadcast to, let's simply install the view directly" in GMS.java line 495? I'm thinking that if the view was sent as a message then both it and the message from D, which has gone via SEQUENCER, will be messages from A and therefore the overtaking seen above must not happen.
- However, I've been unable to convince myself that this will fix the more general case where a message from subgroup 1 arrives in subgroup 2 before subgroup 2 has installed the view.
- Indeed, I'm struggling to think of a way to make this work without changing the way that merge-views are installed.
- It feels to me as though the cleanest way to achieve what I'm looking for would be to have the new coordinator broadcast the new view to everyone, rather than having each of the old coordinators deal with its own subgroup. Then there are no races between subgroups.
- But I expect there are reasons for it to work the way that it does?
- And I expect that this would be quite a disruptive change to make.
What do you think?
Thanks, as ever, for your help...