There's a race condition in FLUSH when the master node is leaving the cluster,
that can cause the master to not send a new view (with a new master) before leaving.
The FLUSH is started when GMS sends down an Event.SUSPEND.
FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
onSuspend sends a START_FLUSH message down.
In the working case, the local node gets the START_FLUSH first.
FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
onStartFlush sets the member variable "flushMembers".
Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
FLUSH.up calls FLUSH.onFlushCompleted.
onFlushCompleted checks "flushMembers" against the list of replies.
If they match (and flushMembers is not null), the flush completes.
But in the non-working case, the FLUSH_COMPLETED from the other
nodes is processed before the local START_FLUSH.
In this case, flushMembers has not been set, and onFlushCompleted
does nothing, expecting more replies (which never come).
I believe this will only be triggered when the master is leaving,
because it does not include itself in the FLUSH. If it was a flush
member, there would be a FLUSH_COMPLETED reply from itself to
trigger setting flushMembers at some point.
- blocks
-
JBPAPP-5900 Race condition in FLUSH when master leaves cluster
- Closed
-
JBPAPP-2509 FLUSH/JOIN failure with a node joining the group
- Resolved