[submitted by Oleg Afanasiev (xbalanque at mail.ru)]
I found the following deadlock in protocol FD:
Found one Java-level deadlock:
=============================
"IncomingPacketHandler (channel=+slee_5servers)":
waiting to lock monitor 0x0000002b80fcd488 (object 0x0000002aa7788330, a java.lang.Object),
which is held by "Timer-4"
"Timer-4":
waiting to lock monitor 0x0000002b80fcd848 (object 0x0000002aa778adc8, a java.util.Vector),
which is held by "IncomingPacketHandler (channel=+slee_5servers)"
Java stack information for the threads listed above:
===================================================
"IncomingPacketHandler (channel=+slee_5servers)":
at org.jgroups.protocols.FD$Broadcaster.stopBroadcastTask (FD.java:602)
- waiting to lock <0x0000002aa7788330> (a java.lang.Object)
at org.jgroups.protocols.FD$Broadcaster.adjustSuspectedMembers (FD.java:649) - locked <0x0000002aa778adc8> (a java.util.Vector)
at org.jgroups.protocols.FD.down(FD.java:318) - locked <0x0000002aa7083c08> (a org.jgroups.protocols.FD)
at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
at org.jgroups.stack.Protocol.down(Protocol.java:559)
at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
at org.jgroups.protocols.pbcast.NAKACK.down(NAKACK.java:480)
at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
at org.jgroups.protocols.UNICAST.down(UNICAST.java:391)
at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
at org.jgroups.protocols.pbcast.STABLE.down(STABLE.java:287)
at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
at org.jgroups.protocols.VIEW_SYNC.down(VIEW_SYNC.java:166)
at org.jgroups.stack.Protocol.receiveDownEvent(Protocol.java: 499)
at org.jgroups.stack.Protocol.passDown(Protocol.java:533)
at org.jgroups.protocols.pbcast.GMS.installView(GMS.java:509) - locked <0x0000002aa76ee7e0> (a org.jgroups.Membership)
at org.jgroups.protocols.pbcast.GMS.installView(GMS.java:422)
at org.jgroups.protocols.pbcast.CoordGmsImpl.handleViewChange (CoordGmsImpl.java:568)
...
"Timer-4":
at java.util.Vector.clone(Vector.java:638)
- waiting to lock <0x0000002aa778adc8> (a java.util.Vector)
at org.jgroups.protocols.FD$Broadcaster.startBroadcastTask (FD.java:587) - locked <0x0000002aa7788330> (a java.lang.Object)
at org.jgroups.protocols.FD$Broadcaster.addSuspectedMember (FD.java:621)
at org.jgroups.protocols.FD$Monitor.run(FD.java:541)
at org.jgroups.util.TimeScheduler$TaskWrapper.run (TimeScheduler.java:204)
at java.util.TimerThread.mainLoop(Timer.java:512)
at java.util.TimerThread.run(Timer.java:462)
The source of deadlock is the following piece of code:
If addSuspectedMember from own timer task and adjustSuspectedMembers
from GMS arrive simultaneously, we've got a deadlock on method clone().
protected final class Broadcaster {
final Vector suspected_mbrs=new Vector(7);
BroadcastTask task=null;
private final Object bcast_mutex=new Object();
Vector getSuspectedMembers()
{ return suspected_mbrs; }/**
- Starts a new task, or - if already running - adds the
argument to the running task. - @param suspect
*/
private void startBroadcastTask(Address suspect) {
synchronized(bcast_mutex)Unknown macro: { if(task == null || task.cancelled()) { task=new BroadcastTask((Vector) suspected_mbrs.clone()); /// this clone method is synchronized and causes a deadlock task.addSuspectedMember(suspect); task.run(); // run immediately the first time timer.add(task); // then every timeout milliseconds, until cancelled if(log.isTraceEnabled()) log.trace("BroadcastTask started"); } else { task.addSuspectedMember(suspect); } }}
private void stopBroadcastTask() {
synchronized(bcast_mutex) {
if(task != null)
}
}
/** Adds a suspected member. Starts the task if not yet
running */
protected void addSuspectedMember(Address mbr) {
if(mbr == null) return;
if(!members.contains(mbr)) return;
boolean added=false;
synchronized(suspected_mbrs) {
if(!suspected_mbrs.contains(mbr))
}
if(added)
startBroadcastTask(mbr);
}
void removeSuspectedMember(Address suspected_mbr) {
if(suspected_mbr == null) return;
if(log.isDebugEnabled()) log.debug("member is " +
suspected_mbr);
synchronized(suspected_mbrs)
}
void removeAll() {
synchronized(suspected_mbrs)
}
/** Removes all elements from suspected_mbrs that are
<em>not</em> in the new membership */
void adjustSuspectedMembers(List new_mbrship) {
if(new_mbrship == null || new_mbrship.isEmpty()) return;
StringBuffer sb=new StringBuffer();
synchronized(suspected_mbrs)
}
}
The proposed solution is to move startBroadcastTask() back inside
synchronized(suspected_mbrs)
block in the addSuspectedMember method. So it'll look like:
/** Adds a suspected member. Starts the task if not yet
running */
protected void addSuspectedMember(Address mbr) {
if(mbr == null) return;
if(!members.contains(mbr)) return;
boolean added=false;
synchronized(suspected_mbrs) {
if(!suspected_mbrs.contains(mbr))
}
}
Or we may use a separate monitor for operations that use
suspected_mbrs to synchronize.