-
Feature Request
-
Resolution: Done
-
Critical
-
2.6
-
None
I'm running into a few problems when multiple members request a FLUSH at the same time.
I am still in the process of analyzing the situation, but here are a few problems:
private void handleStartFlush(Message msg, FlushHeader fh) {
byte oldPhase = flushPhase.transitionToFirstPhase();
if(oldPhase == FlushPhase.START_PHASE)
else if(oldPhase == FlushPhase.FIRST_PHASE){
Address flushRequester = msg.getSrc();
Address coordinator = null;
synchronized(sharedLock)
After a successful transtion to the first phase, onStartFlush is called. Inside the method, the flushCoordinator is set. However a simultaneous handleStartFlush from another member would fail the transitionToFirstPhase() call, but it is possible that the flushCoordinator might not be set yet and the second handleStartFlush would end up setting the 'coordinator' to the flushRequestor. This could grant the second START_FLUSH a FLUSH_OK without rejecting the flush that succeeded in the transitionToFirstPhase().
IMHO the transitionToFirstPhase() should assign the flushCoordinator at that time to avoid the above situation, in fact, all of the protected parts of the onStartFlush() call should be made inside the protected section of transitionToFirstPhase().
The above should take care of the didn't-succeed-in-transitionToFirstPhase-but-became-flushCoordinator-anyways-because-flushCoordinator-wasn't-set-yet problem.
Now, there is another problem in the way simultaneous START_FLUSHes are resolved. If the new flushRequestor is < the current flushCoordinator the new requestor will win the flush fight. This is a great strategy because all the members will get the same result here.
If A, B, C are members and A < B < C and they all simultaneously request a flush, then the desired effect is for everyone to decide that A will win the flush fight.
So the following code handles that situation...
private void handleStartFlush(Message msg, FlushHeader fh)
{ ... }else if(oldPhase == FlushPhase.FIRST_PHASE){
...
if(flushRequester.compareTo(coordinator) < 0){
rejectFlush(fh.viewID, coordinator);
...
synchronized(sharedLock)
}
If the new flushRequetor is < the old flushCoordinator than reject/abort the old flush. However, the new flushCoordinator is never acknowledged by a onStartFlush (which sends FLUSH_OK). If we moved all the protected parts of onStartFlush() to transitionToFirstPhase() then the flush members would never be set properly, so that situation needs to be resolved as well.
I think this might be the problem I'm seeing because I wait indefinitely for my flush to succeed.
I will be testing a patch this afternoon and hope to report back.