Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-336

Move the blocking in FLUSH from START_FLUSH to FLUSH_OK

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • 2.4
    • 2.4
    • None
    • High

      With FLUSH, every member starts blocking outgoing calls in FLUSH.down() as soon as START_FLUSH has been received and block() returned. Then, a FLUSH_OK is multicast to the group. The issue with this is that sometimes a member might want to complete some work in block(), which might be sending a PREPARE or COMMIT multicast across the cluster and waiting for all replies (or a timeout).
      As described in point #A in Brian's email (below), this won't work as the unicast response to the PREPARE or COMMIT call might block in FLUSH.down() if that member already received the START_FLUSH.
      Outlined in point #B below, if we moved the blocking from START_FLUSH (after block() returns) to after we have received all FLUSH_OK responses from all members, then PREPARE/COMMIT would be able to complete, because nobody blocks the unicast responses back to (e.g.) P until P's FLUSH_OK has been received, and that's only the case when P's block() returns.
      Contrary to Brian's solution, I think the FLUSH_COMPLETED phase can be unicast to just the initiator of the flush phase, not to the entire group.
      The point of this JIRA issue is to investigate whether moving blocking renders the flush protocol incorrect or not.

      [ excerpt from Brian's email]

      Vladimir and I found a problem today with using FLUSH in a JBC cache.
      Following is a description of the issue and some proposed solutions.
      Comments are welcome.

      Please see docs/design/FLUSH.txt in JGroups for background info on how
      FLUSH works.

      A) We have a problem in that the FLUSH protocol makes the decision to
      shut off the ability to pass messages down the channel independently at
      each node. The protocol doesn't include anything at the JGroups level
      to readily support coordination between nodes as to when to shut off
      down messages. But, JBC needs coordination since it needs to make RPC
      calls around the cluster (e.g. commit()) as part of how it handles
      FLUSH.

      Basically, when the FLUSH protocol on a node receives a message telling
      it to START_FLUSH, it calls block() on the JBC instance. JBC does what
      it needs to do, then returns from block(). Following the return from
      block() the FLUSH protocol in that channel then begins blocking any
      further down() messages.

      Problem is as follows. 2 node REPL_SYNC cluster, A B where A is just
      starting up and thus initiates a FLUSH:

      1) JBC on B has tx in progress, just starting the 2PC. Sends out the
      prepare().
      2) A sends out a START_FLUSH message.
      3) A gets START_FLUSH, calls block() on JBC.
      4) JBC on A is new, doesn't have much going on, very quickly returns
      from block(). A will no longer pass down any messages below FLUSH.
      5) A gets the prepare() (no problem, FLUSH doesn't block up messages,
      just down messages.)
      6) A executes the prepare(), but can't send the response to B because
      FLUSH is blocking the channel.
      7) B gets the START_FLUSH, calls block() on JBC.
      8) JBC B doesn't immediately return from block() as it is giving the
      prepare() some time to complete (avoid unnecessary tx rollback). But
      prepare() won't complete because A's channel is blocking the RPC
      response!! Eventually JBC B's block() impl will have to roll back the
      tx.

      Basically you have a race condition between calls to block() and
      prepare() calls, and can have different winners on different nodes.

      B) A solution we discussed, rejected and then came back to this evening
      (please read FLUSH.txt to understand the change we're discussing):

      Channel does not block down messages when block() returns. Rather it
      just sends out a FLUSH_OK message (see FLUSH.txt). It shouldn't
      initiate any new cluster activity (e.g. a prepare()) after sending
      FLUSH_OK, but it can respond to RPC calls. When it gets a FLUSH_OK from
      all the other members, it then blocks down messages and multicasts a
      FLUSH_COMPLETED to the cluster.

      Differences from the current FLUSH impl:

      1) Node doesn't begin blocking down messages before sending FLUSH_OK.
      2) Node begins blocking down messages before sending FLUSH_COMPLETED.
      3) Node multicasts FLUSH_COMPLETED, rather than unicasting to the node
      that initiated the FLUSH.
      4) Nodes regard the FLUSH_COMPLETED as the last message from another
      node, rather than the FLUSH_OK.

      A downside of this idea is it changes the semantics of flush and
      requires JGroups changes. We'd definitely like input from Bela on this.
      Also, since we initially rejecting it, we haven't fully thought it
      through. (As I'm editing this to send out I see there is no way to tell
      JBC after it returns from block() to not let any "new" activity through
      – big hole. I'm back to rejecting this approach.)

      C) Alternative idea we discussed was to do application level
      coordination around the cluster, i.e. add something similar to the
      existing FLUSH_OK/FLUSH_COMPLETED, but at the JBC level. Revising the
      previous scenario:

      1) JBC on B has tx in progress, just starting the 2PC. Sends out the
      prepare().
      2) A sends out a START_FLUSH message.
      3) A gets START_FLUSH, calls block().
      4) JBC on A is new, doesn't have much going on, so doesn't do cleanup
      work on its own node.
      4.1) JBC on A sends out an RPC call with its address as an arg to a new
      "flushReady()" method added to TreeCache. (Other name for method is
      fine.)
      4.2) JBC on A blocks waiting for flushReady() RPC calls from all the
      other members. Does not return from block().
      5) A gets the prepare() (no problem, FLUSH doesn't block up messages,
      just down messages.)
      6) A executes the prepare(), can send the response to B because FLUSH
      isn't blocking the channel.
      7) B gets the START_FLUSH, calls block().
      8) JBC B doesn't immediately return from block() as it detects it has a
      2PC in progress and is giving the prepare() some time to complete (avoid
      unnecessary tx rollback).
      9) JBC B receives flushReady() call from A, adds entry to a vector
      recording A is ready.
      10) B receives prepare() response from A, sends commit().
      12) B sends out RPC call with its address to "flushReady()" method
      11) A receives commit(), commits tx.
      12) A receives flushReady() call from B. Adds entry to a vector
      recording that B is ready.
      13) A sees that all other nodes are ready, returns from block().
      14) B sees that all other nodes are ready, returns from block().

      Downside to this is complexity and requirement to add another method for
      the "flushReady()" RPC.

      D) A 3rd alternative is to just accept the problem. The problem is a
      race condition where A blocks down events but then receives a prepare().
      Its response to prepare() cannot be sent. The effect is JBC B's impl of
      FLUSH will detect the prepare() isn't progressing and at some point roll
      back the tx. This will result in a rollback() message being sent to A.
      A can receive it and roll back the tx. IIRC a rollback() is always
      async, so A does not need to send a response. A and B end up in a valid
      state.

      Downside of this is the tx gets rolled back. This could be a frequent
      occurrence in high load scenarios because a new node in the cluster
      could be expected to very quickly call blockOK(), possibly even before
      the START_FLUSH message goes out on the wire.

      Brian Stansberry
      Lead, AS Clustering
      JBoss, a division of Red Hat

              vblagoje Vladimir Blagojevic (Inactive)
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: