Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-553

Deadlock in FD protocol

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.4.1 SP4, 2.5
    • None
    • None

      [submitted by Oleg Afanasiev (xbalanque at mail.ru)]

      I found the following deadlock in protocol FD:

      Found one Java-level deadlock:
      =============================
      "IncomingPacketHandler (channel=+slee_5servers)":
      waiting to lock monitor 0x0000002b80fcd488 (object 0x0000002aa7788330, a java.lang.Object),
      which is held by "Timer-4"

      "Timer-4":
      waiting to lock monitor 0x0000002b80fcd848 (object 0x0000002aa778adc8, a java.util.Vector),
      which is held by "IncomingPacketHandler (channel=+slee_5servers)"

      Java stack information for the threads listed above:
      ===================================================
      "IncomingPacketHandler (channel=+slee_5servers)":
      at org.jgroups.protocols.FD$Broadcaster.stopBroadcastTask (FD.java:602)

      • waiting to lock <0x0000002aa7788330> (a java.lang.Object)
        at org.jgroups.protocols.FD$Broadcaster.adjustSuspectedMembers (FD.java:649)
      • locked <0x0000002aa778adc8> (a java.util.Vector)
        at org.jgroups.protocols.FD.down(FD.java:318)
      • locked <0x0000002aa7083c08> (a org.jgroups.protocols.FD)
        at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
        at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
        at org.jgroups.stack.Protocol.down(Protocol.java:559)
        at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
        at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
        at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:480)
        at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
        at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
        at org.jgroups.protocols.UNICAST.down(UNICAST.java:391)
        at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
        at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
        at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:287)
        at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
        at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
        at org.jgroups.protocols.VIEW_SYNC.down(VIEW_SYNC.java:166)
        at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
        at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
        at org.jgroups.protocols.pbcast.GMS.installView(GMS.java:509)
      • locked <0x0000002aa76ee7e0> (a org.jgroups.Membership)
        at org.jgroups.protocols.pbcast.GMS.installView(GMS.java:422)
        at org.jgroups.protocols.pbcast.CoordGmsImpl.handleViewChange (CoordGmsImpl.java:568)
        ...

      "Timer-4":
      at java.util.Vector.clone(Vector.java:638)

      • waiting to lock <0x0000002aa778adc8> (a java.util.Vector)
        at org.jgroups.protocols.FD$Broadcaster.startBroadcastTask (FD.java:587)
      • locked <0x0000002aa7788330> (a java.lang.Object)
        at org.jgroups.protocols.FD$Broadcaster.addSuspectedMember (FD.java:621)
        at org.jgroups.protocols.FD$Monitor.run(FD.java:541)
        at org.jgroups.util.TimeScheduler$TaskWrapper.run (TimeScheduler.java:204)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)

      The source of deadlock is the following piece of code:
      If addSuspectedMember from own timer task and adjustSuspectedMembers
      from GMS arrive simultaneously, we've got a deadlock on method clone().

      protected final class Broadcaster {
      final Vector suspected_mbrs=new Vector(7);
      BroadcastTask task=null;
      private final Object bcast_mutex=new Object();

      Vector getSuspectedMembers()

      { return suspected_mbrs; }

      /**

      • Starts a new task, or - if already running - adds the
        argument to the running task.
      • @param suspect
        */
        private void startBroadcastTask(Address suspect) {
        synchronized(bcast_mutex)
        Unknown macro: { if(task == null || task.cancelled()) { task=new BroadcastTask((Vector) suspected_mbrs.clone()); /// this clone method is synchronized and causes a deadlock task.addSuspectedMember(suspect); task.run(); // run immediately the first time timer.add(task); // then every timeout milliseconds, until cancelled if(log.isTraceEnabled()) log.trace("BroadcastTask started"); } else { task.addSuspectedMember(suspect); } }

        }

      private void stopBroadcastTask() {
      synchronized(bcast_mutex) {
      if(task != null)

      { task.stop(); task=null; }

      }
      }

      /** Adds a suspected member. Starts the task if not yet
      running */
      protected void addSuspectedMember(Address mbr) {
      if(mbr == null) return;
      if(!members.contains(mbr)) return;
      boolean added=false;
      synchronized(suspected_mbrs) {
      if(!suspected_mbrs.contains(mbr))

      { suspected_mbrs.addElement(mbr); added=true; }

      }
      if(added)
      startBroadcastTask(mbr);
      }

      void removeSuspectedMember(Address suspected_mbr) {
      if(suspected_mbr == null) return;
      if(log.isDebugEnabled()) log.debug("member is " +
      suspected_mbr);
      synchronized(suspected_mbrs)

      { suspected_mbrs.removeElement(suspected_mbr); if(suspected_mbrs.isEmpty()) stopBroadcastTask(); }

      }

      void removeAll() {
      synchronized(suspected_mbrs)

      { suspected_mbrs.removeAllElements(); stopBroadcastTask(); }

      }

      /** Removes all elements from suspected_mbrs that are
      <em>not</em> in the new membership */
      void adjustSuspectedMembers(List new_mbrship) {
      if(new_mbrship == null || new_mbrship.isEmpty()) return;
      StringBuffer sb=new StringBuffer();
      synchronized(suspected_mbrs)

      { sb.append("suspected_mbrs: ").append(suspected_mbrs); suspected_mbrs.retainAll(new_mbrship); if(suspected_mbrs.isEmpty()) stopBroadcastTask(); sb.append(", after adjustment: ").append (suspected_mbrs); log.debug(sb.toString()); }

      }
      }

      The proposed solution is to move startBroadcastTask() back inside
      synchronized(suspected_mbrs)
      block in the addSuspectedMember method. So it'll look like:

      /** Adds a suspected member. Starts the task if not yet
      running */
      protected void addSuspectedMember(Address mbr) {
      if(mbr == null) return;
      if(!members.contains(mbr)) return;
      boolean added=false;
      synchronized(suspected_mbrs) {
      if(!suspected_mbrs.contains(mbr))

      { suspected_mbrs.addElement(mbr); startBroadcastTask(mbr); }

      }
      }

      Or we may use a separate monitor for operations that use
      suspected_mbrs to synchronize.

              rhn-engineering-bban Bela Ban
              rhn-engineering-bban Bela Ban
              Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

                Created:
                Updated:
                Resolved: