Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1282

Race condition in FLUSH when master leaves cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.6.19, 2.11.2, 2.12
    • 2.6.16
    • None

      There's a race condition in FLUSH when the master node is leaving the cluster,
      that can cause the master to not send a new view (with a new master) before leaving.

      The FLUSH is started when GMS sends down an Event.SUSPEND.
      FLUSH.down calls FLUSH.startFlush, which calls FLUSH.onSuspend.
      onSuspend sends a START_FLUSH message down.

      In the working case, the local node gets the START_FLUSH first.
      FLUSH.up calls FLUSH.handleStartFlush, which calls FLUSH.onStartFlush.
      onStartFlush sets the member variable "flushMembers".

      Then the other nodes reply to the START_FLUSH with a FLUSH_COMPLETED.
      FLUSH.up calls FLUSH.onFlushCompleted.
      onFlushCompleted checks "flushMembers" against the list of replies.
      If they match (and flushMembers is not null), the flush completes.

      But in the non-working case, the FLUSH_COMPLETED from the other
      nodes is processed before the local START_FLUSH.
      In this case, flushMembers has not been set, and onFlushCompleted
      does nothing, expecting more replies (which never come).

      I believe this will only be triggered when the master is leaving,
      because it does not include itself in the FLUSH. If it was a flush
      member, there would be a FLUSH_COMPLETED reply from itself to
      trigger setting flushMembers at some point.

              vblagoje Vladimir Blagojevic (Inactive)
              rhn-support-dereed Dennis Reed
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: