Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-335

Hangs with FLUSH

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Blocker
    • 2.4
    • 2.3 SP1
    • None
    • High

    Description

      2 use cases where we can run into the problem (members A and B).

      #1 View change

      • A is running, B joins
      • B is not blocking in FLUSH, A is blocking after START_FLUSH
      • A starts the flush
      • A returns the new view to B in a JOIN_RSP. This causes B's Channel.connect() to return
      • B sends a unicast message to A, to which A sends a response in the same thread (service STATE_REQ)
      • A competes the flush, multicasting a STOP_FLUSH message
      • The STATE_REQ at A hangs on FLUSH.down()
      • The STOP_FLUSH at A can never unblock FLUSH.down() because it was received after the STATE_REQ from B !

      SOLUTION:

      1. Make B block in FLUSH.down() as soon as the client sends the JOIN_REQ to A
      2. Make STOP_FLUSH synchronous. This means we only return from Channel.connect() (for example) once every member has ack'ed the STOP_FLUSH. See next issue (state transfer) for a description of what happens if we don't do this.

      #2 State transfer

      • A and B are members of the group
      • B calls Channel.getState()
      • A and B receive a START_FLUSH, start the block in FLUSH
      • State is transferred from A to B
      • B multicasts a STOP_FLUSH and immediately afterwards sends a unicast message (which can 'pass' multicast messages, as they're unrelated)
      • A happens to receive the unicast message before the STOP_FLUSH. The unicast blocks and the STOP_FLUSH, which would unblock it, cannot be delivered

      SOLUTION:

      1. Same as solution 2 above. If we make the STOP_FLUSH phase synchronous, connect() or getState() only return once everyone has been unblocked

      LONG TERM SOLUTION:

      • The much better solution of course is to make the STOP_FLUSH message out-of-band, so it can be delivered in parallel to other messages, and is not blocked (e.g.) by the unicast in the queue. So even if the unicast message was blocked waiting for STOP_FLUSH, once STOP_FLUSH has been received, it will be delivered, causing the unicast to unblock
      • Once we have this solution in place (2.5, threadless stack and out-of-band messages), we can revert the STOP_FLUSH to only use 1 phase rather than 2

      Attachments

        Issue Links

          Activity

            People

              vblagoje Vladimir Blagojevic (Inactive)
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: