[Michael Newcomb]
Still debugging concurrent starting issues... Now I'm running into a
problem with FLUSH.
So, there are 3 current members (A, B, C) and a new one joins (D)...
1. coord starts a flush on A,B,C
2. coord receives FLUSH_COMPLETED from A,B (misses C)
3. coord times out and sleeps a few seconds
4. coord starts a new flush on A,B,C
Here is where the problems start. A,B (and possibly C) are already in a
FLUSH situation. As far as they are concerned a flush is in progress
because they sent FLUSH_COMPLETED to the coord.
So, when they get a new flush, they determine who they are going to
reject (either the currently flushing coordinator or the flush
requestor).
If the flush requestor is < than the current flush coordinator, then a
reject flush is sent to the original flush coordinator and the flush is
proceeded with the flush requestor.
If the flush requestor is > than the current flush coordinator, then a
reject flush is sent to the flush requestor and the flush is proceeded
with the original flush coordinator.
If the flush requestor is == the current flush coordinator, it behaves
the same as if the flush requestor was > the flush coordinator. A reject
flush is sent to the current coordinator and then a FLUSH_COMPLETED is
sent to him...
The problem is that the FLUSH_COMPLETED is basically ignored because the
reject flush sets the promise to FALSE which immediately fails the
flush. This causes another flush retry which results in the same thing
again and again until all the retries are exhausted and the overall
flush fails. Furthermore, the node that rejected the flush is left in
the exact same state: he thinks he is in a flush and will reject any new
flush requests by the current flush coordinator!
Essentially, retrying flushes is a waste of time...
I think that there are several ways to solve this problem.
Since the flush is 'restarted' (onStartFlush is called after the reject
is sent) even when the flush requestor == the current flush coordinator,
there may be no need to reject the flush when the flush requestor == the
current flush coordinator. Only send a reject flush if the
abortFlushCoordinator != proceedFlushCoordinator...
If that is not sufficient, then when the flush requestor == the current
flush coordinator, the node that rejects a flush, should not 'restart'
the flush by calling onStartFlush again (only call onStartFlush if
abortFlushCoordinator != proceedFlushCoordinator). This basically sets
the next flush attempt up for failure again and again because nothing
has changed at the node: he still thinks a flush is on going and will
reject any new flushes from the current flush coordinator.
Again, these cases are for when the flush requestor is == the current
flush coordinator. I have yet to attempt concurrent flush attempts by
different nodes