-
Bug
-
Resolution: Done
-
Major
-
6.0.2.Final
-
None
In a stress test that repeatedly kills nodes while performing read/write operations, the TransferQueueBundler thread seems to spend a lot of time waiting for physical addresses:
06:40:10,316 WARN [org.radargun.utils.Utils] (pool-5-thread-1) Stack for thread TransferQueueBundler,default,apex953-14666: java.lang.Thread.sleep(Native Method) org.jgroups.util.Util.sleep(Util.java:1504) org.jgroups.util.Util.sleepRandom(Util.java:1574) org.jgroups.protocols.TP.sendToSingleMember(TP.java:1685) org.jgroups.protocols.TP.doSend(TP.java:1670) org.jgroups.protocols.TP$TransferQueueBundler.sendBundledMessages(TP.java:2476) org.jgroups.protocols.TP$TransferQueueBundler.sendMessages(TP.java:2392) org.jgroups.protocols.TP$TransferQueueBundler.run(TP.java:2383) java.lang.Thread.run(Thread.java:744)
There are 2 bugs related to this already fixed in JGroups 3.5.0.Beta2+: JGRP-1814, JGRP-1815
There is also a special case where the physical address could be removed from the cache too soon, exacerbating the effect of JGRP-1815: JGRP-1858
We can work around the problem by changing the JGroups configuration:
- TP.logical_addr_cache_expiration=86400000
- Only expire addresses after 1 day
- TP.physical_addr_max_fetch_attempts=1
- Sleep for only 20ms waiting for the physical address (default 3 - 1500ms)
- UNICAST3_conn_close_timeout=10000
- Drop the pending messages to leavers sooner