After much testing and analysis (and reopening and fixing
ISPN-865), the final issue here is that certain transactions throw an IllegalStateException in commit() - and this cascades into a series of problems.
See http://lists.jboss.org/pipermail/infinispan-dev/2011-January/007320.html for a more detailed discussion.
There are two scenarios we're seeing on rehashing, both of which are critical.
1. On a node leaving a running cluster, we're seeing an inordinate amount of timeout errors, such as the one below. The end result of this is that the cluster ends up losing data.
org.infinispan.util.concurrent.TimeoutException: Timed out waiting for valid responses!
06:07:44,097 WARN [GMS] cms-node-20192: merge leader did not get data from all partition coordinators [cms-node-20192, mydht1-18445], merge is cancelled at org.infinispan.commands.read.GetKeyValueCommand.acceptVisitor(GetKeyValueCommand.java:59)
2. Joining a node into a running cluster causes transactional failures on the other nodes. Most of the time, depending on the load, a node can take upwards of 8 minutes to join.
I've attached a unit test that can reproduce these issues.