Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-1229

Deadlock during flush

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.11
    • 2.10
    • None

      In my environment a deadlock during flush occurs under following circumstances: A message is processed wich sends another multicast message. During the processing of the message GMS starts a flush. The sending of the multicast message blocks and the flush fails because the processed message doesn't end. The ABORT_FLUSH message isn't processed because it's not a OOB message. The blocked message never unblocks.

      The following Threaddump shows the GMS thread and the message processing thread:

      "ViewHandler,Test,#A" prio=6 tid=0x04015c00 nid=0x1e8 waiting on condition [0x04b0f000]
      java.lang.Thread.State: TIMED_WAITING (sleeping)
      at java.lang.Thread.sleep(Native Method)
      at org.jgroups.util.Util.sleep(Util.java:1298)
      at org.jgroups.util.Util.sleepRandom(Util.java:1377)
      at org.jgroups.protocols.pbcast.GMS._startFlush(GMS.java:721)
      at org.jgroups.protocols.pbcast.GMS.startFlush(GMS.java:694)
      at org.jgroups.protocols.pbcast.CoordGmsImpl.handleMembershipChange(CoordGmsImpl.java:189)
      at org.jgroups.protocols.pbcast.GMS$ViewHandler.process(GMS.java:1390)
      at org.jgroups.protocols.pbcast.GMS$ViewHandler.run(GMS.java:1344)
      at java.lang.Thread.run(Thread.java:619)

      "Incoming-4,Test,#A" prio=6 tid=0x04014000 nid=0x11f4 waiting on condition [0x0497e000]
      java.lang.Thread.State: TIMED_WAITING (parking)
      at sun.misc.Unsafe.park(Native Method)

      • parking to wait for <0x248b2cd8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2054)
        at org.jgroups.protocols.pbcast.FLUSH.blockMessageDuringFlush(FLUSH.java:321)
        at org.jgroups.protocols.pbcast.FLUSH.down(FLUSH.java:254)
        at org.jgroups.stack.ProtocolStack.down(ProtocolStack.java:894)
        at org.jgroups.JChannel.down(JChannel.java:1623)
        at org.jgroups.JChannel.send(JChannel.java:724)
        ... (Usercode)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.handleUpEvent(MessageDispatcher.java:640)
        at org.jgroups.blocks.MessageDispatcher$ProtocolAdapter.up(MessageDispatcher.java:772)
        at org.jgroups.JChannel.up(JChannel.java:1453)
        at org.jgroups.stack.ProtocolStack.up(ProtocolStack.java:887)
        at org.jgroups.protocols.pbcast.FLUSH.up(FLUSH.java:435)
        at org.jgroups.protocols.pbcast.STATE_TRANSFER.up(STATE_TRANSFER.java:151)
        at org.jgroups.protocols.FRAG2.up(FRAG2.java:188)
        at org.jgroups.protocols.FC.up(FC.java:474)
        at org.jgroups.protocols.pbcast.GMS.up(GMS.java:888)
        at org.jgroups.protocols.pbcast.STABLE.up(STABLE.java:234)
        at org.jgroups.protocols.UNICAST.handleDataReceived(UNICAST.java:614)
        at org.jgroups.protocols.UNICAST.up(UNICAST.java:294)
        at org.jgroups.protocols.pbcast.NAKACK.up(NAKACK.java:707)
        at org.jgroups.protocols.VERIFY_SUSPECT.up(VERIFY_SUSPECT.java:132)
        at org.jgroups.protocols.FD.up(FD.java:266)
        at org.jgroups.protocols.MERGE2.up(MERGE2.java:210)
        at org.jgroups.protocols.Discovery.up(Discovery.java:281)
        at org.jgroups.protocols.TP.passMessageUp(TP.java:1009)
        at org.jgroups.protocols.TP.access$100(TP.java:56)
        at org.jgroups.protocols.TP$3.run(TP.java:933)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:619)

      DEBUG output:

      19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
      FEIN: new=C, suspected=[], leaving=[], new view: [#A|1] A, #C
      19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
      FEIN: #A: flush coordinator is starting FLUSH with participants A
      19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
      FEIN: #A: received START_FLUSH, responded with FLUSH_COMPLETED to #A
      19.08.2010 13:47:41 org.jgroups.logging.JDKLogImpl debug
      FEIN: #A: blocking for 10000ms
      19.08.2010 13:47:43 org.jgroups.logging.JDKLogImpl debug
      FEIN: #A: timed out waiting for flush responses after 2000 ms. Rejecting flush to participants A
      19.08.2010 13:47:51 org.jgroups.logging.JDKLogImpl debug
      FEIN: #A: blocking for 10000ms
      19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl warn
      WARNUNG: #A: GMS flush by coordinator failed
      19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
      FEIN: resuming message garbage collection
      19.08.2010 13:47:55 org.jgroups.logging.JDKLogImpl debug
      FEIN: new=B, suspected=[], leaving=[], new view: [#A|2] A, #C, #B
      19.08.2010 13:48:01 org.jgroups.logging.JDKLogImpl debug
      FEIN: #A: blocking for 10000ms
      19.08.2010 13:48:06 org.jgroups.logging.JDKLogImpl warn
      WARNUNG: #A: GMS flush by coordinator failed
      ...

      My solution (that works in my environment, but I can't overlook if it's correct):

      Changes to org.jgroups.protocols.pbcast.FLUSH

      // 1. set the OOB flag for the ABORT_FLUSH message
      552 private void rejectFlush(Collection<? extends Address> participants, long viewId) {
      553 for (Address flushMember : participants) {
      554 Message reject = new Message(flushMember, localAddress, null);
      555! reject.setFlag(Message.OOB);

      // 2. the processing of the ABORT_FLUSH message has at least to set "isBlockingFlushDown = false"
      // but I copied the complete implementation from the "onStopFlush" method
      336 public Object up(Event evt) {
      ...
      368 case FlushHeader.ABORT_FLUSH:
      ...
      379! synchronized (sharedLock)

      { 380! flushCompletedMap.clear(); 381! flushNotCompletedMap.clear(); 382! flushMembers.clear(); 383! suspected.clear(); 384! flushCoordinator = null; 385! flushCompleted = false; 386! }

      387! blockMutex.lock();
      388! try

      { 389! isBlockingFlushDown = false; 390! notBlockedDown.signalAll(); 391! }

      finally

      { 392! blockMutex.unlock(); 393! }

      394! flushInProgress.set(false);

              vblagoje Vladimir Blagojevic (Inactive)
              hamphh Markus Hampel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: