-
Feature Request
-
Resolution: Done
-
Major
-
2.4
-
None
-
High
With FLUSH, every member starts blocking outgoing calls in FLUSH.down() as soon as START_FLUSH has been received and block() returned. Then, a FLUSH_OK is multicast to the group. The issue with this is that sometimes a member might want to complete some work in block(), which might be sending a PREPARE or COMMIT multicast across the cluster and waiting for all replies (or a timeout).
As described in point #A in Brian's email (below), this won't work as the unicast response to the PREPARE or COMMIT call might block in FLUSH.down() if that member already received the START_FLUSH.
Outlined in point #B below, if we moved the blocking from START_FLUSH (after block() returns) to after we have received all FLUSH_OK responses from all members, then PREPARE/COMMIT would be able to complete, because nobody blocks the unicast responses back to (e.g.) P until P's FLUSH_OK has been received, and that's only the case when P's block() returns.
Contrary to Brian's solution, I think the FLUSH_COMPLETED phase can be unicast to just the initiator of the flush phase, not to the entire group.
The point of this JIRA issue is to investigate whether moving blocking renders the flush protocol incorrect or not.
[ excerpt from Brian's email]
Vladimir and I found a problem today with using FLUSH in a JBC cache.
Following is a description of the issue and some proposed solutions.
Comments are welcome.
Please see docs/design/FLUSH.txt in JGroups for background info on how
FLUSH works.
A) We have a problem in that the FLUSH protocol makes the decision to
shut off the ability to pass messages down the channel independently at
each node. The protocol doesn't include anything at the JGroups level
to readily support coordination between nodes as to when to shut off
down messages. But, JBC needs coordination since it needs to make RPC
calls around the cluster (e.g. commit()) as part of how it handles
FLUSH.
Basically, when the FLUSH protocol on a node receives a message telling
it to START_FLUSH, it calls block() on the JBC instance. JBC does what
it needs to do, then returns from block(). Following the return from
block() the FLUSH protocol in that channel then begins blocking any
further down() messages.
Problem is as follows. 2 node REPL_SYNC cluster, A B where A is just
starting up and thus initiates a FLUSH:
1) JBC on B has tx in progress, just starting the 2PC. Sends out the
prepare().
2) A sends out a START_FLUSH message.
3) A gets START_FLUSH, calls block() on JBC.
4) JBC on A is new, doesn't have much going on, very quickly returns
from block(). A will no longer pass down any messages below FLUSH.
5) A gets the prepare() (no problem, FLUSH doesn't block up messages,
just down messages.)
6) A executes the prepare(), but can't send the response to B because
FLUSH is blocking the channel.
7) B gets the START_FLUSH, calls block() on JBC.
8) JBC B doesn't immediately return from block() as it is giving the
prepare() some time to complete (avoid unnecessary tx rollback). But
prepare() won't complete because A's channel is blocking the RPC
response!! Eventually JBC B's block() impl will have to roll back the
tx.
Basically you have a race condition between calls to block() and
prepare() calls, and can have different winners on different nodes.
B) A solution we discussed, rejected and then came back to this evening
(please read FLUSH.txt to understand the change we're discussing):
Channel does not block down messages when block() returns. Rather it
just sends out a FLUSH_OK message (see FLUSH.txt). It shouldn't
initiate any new cluster activity (e.g. a prepare()) after sending
FLUSH_OK, but it can respond to RPC calls. When it gets a FLUSH_OK from
all the other members, it then blocks down messages and multicasts a
FLUSH_COMPLETED to the cluster.
Differences from the current FLUSH impl:
1) Node doesn't begin blocking down messages before sending FLUSH_OK.
2) Node begins blocking down messages before sending FLUSH_COMPLETED.
3) Node multicasts FLUSH_COMPLETED, rather than unicasting to the node
that initiated the FLUSH.
4) Nodes regard the FLUSH_COMPLETED as the last message from another
node, rather than the FLUSH_OK.
A downside of this idea is it changes the semantics of flush and
requires JGroups changes. We'd definitely like input from Bela on this.
Also, since we initially rejecting it, we haven't fully thought it
through. (As I'm editing this to send out I see there is no way to tell
JBC after it returns from block() to not let any "new" activity through
– big hole. I'm back to rejecting this approach.)
C) Alternative idea we discussed was to do application level
coordination around the cluster, i.e. add something similar to the
existing FLUSH_OK/FLUSH_COMPLETED, but at the JBC level. Revising the
previous scenario:
1) JBC on B has tx in progress, just starting the 2PC. Sends out the
prepare().
2) A sends out a START_FLUSH message.
3) A gets START_FLUSH, calls block().
4) JBC on A is new, doesn't have much going on, so doesn't do cleanup
work on its own node.
4.1) JBC on A sends out an RPC call with its address as an arg to a new
"flushReady()" method added to TreeCache. (Other name for method is
fine.)
4.2) JBC on A blocks waiting for flushReady() RPC calls from all the
other members. Does not return from block().
5) A gets the prepare() (no problem, FLUSH doesn't block up messages,
just down messages.)
6) A executes the prepare(), can send the response to B because FLUSH
isn't blocking the channel.
7) B gets the START_FLUSH, calls block().
8) JBC B doesn't immediately return from block() as it detects it has a
2PC in progress and is giving the prepare() some time to complete (avoid
unnecessary tx rollback).
9) JBC B receives flushReady() call from A, adds entry to a vector
recording A is ready.
10) B receives prepare() response from A, sends commit().
12) B sends out RPC call with its address to "flushReady()" method
11) A receives commit(), commits tx.
12) A receives flushReady() call from B. Adds entry to a vector
recording that B is ready.
13) A sees that all other nodes are ready, returns from block().
14) B sees that all other nodes are ready, returns from block().
Downside to this is complexity and requirement to add another method for
the "flushReady()" RPC.
D) A 3rd alternative is to just accept the problem. The problem is a
race condition where A blocks down events but then receives a prepare().
Its response to prepare() cannot be sent. The effect is JBC B's impl of
FLUSH will detect the prepare() isn't progressing and at some point roll
back the tx. This will result in a rollback() message being sent to A.
A can receive it and roll back the tx. IIRC a rollback() is always
async, so A does not need to send a response. A and B end up in a valid
state.
Downside of this is the tx gets rolled back. This could be a frequent
occurrence in high load scenarios because a new node in the cluster
could be expected to very quickly call blockOK(), possibly even before
the START_FLUSH message goes out on the wire.
Brian Stansberry
Lead, AS Clustering
JBoss, a division of Red Hat