-
Bug
-
Resolution: Done
-
Major
-
5.3.0.Final
-
None
Problem
It is possible to get the node, which would be permanently locked, when stressed with multiple concurrent writes and, perhaps, unfortunate timing. For more details and a detailed explanation as to how to reproduce the issue, please refer to the following post:
As suggested by hchiorean, once the MODE-2670 was resolved, I pulled in the latest code and gave the aforementioned test case another try. The results were surprising. Here is a comparison between what I saw before and what I see now.
Before:
- A few runs of the provided test case, using the same DB between runs, could result in the corruption of the originally locked node, where multiple attempts from multiple members of the cluster, performed after configured lock expiration, could not unlock the node.
- When corruption happens in the previous step, an attempt to run the same test again will fail right away for various underlying reasons, e.g. NullPointerException, DocumentAlreadyExistsException. Note that all members of the cluster could be stood up for actual test to begin, i.e. there were no problems getting members online.
Now:
- Every run of the test case hangs after ~ 20 seconds, i.e. there are some successful/failed actions taking place for some time, but after that everything freezes. The thread dump reveals plenty of threads waiting on a condition to happen, for instance:
"modeshape-event-dispatcher-17-thread-1" #154 prio=5 os_prio=31 tid=0x00007fb8d349d000 nid=0x14d07 waiting on condition [0x000000012bb64000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x00000007710e3068> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at org.modeshape.common.collection.ring.BlockingWaitStrategy.waitFor(BlockingWaitStrategy.java:46) at org.modeshape.common.collection.ring.GarbageCollectingConsumer$1.waitFor(GarbageCollectingConsumer.java:61) at org.modeshape.common.collection.ring.GarbageCollectingConsumer.run(GarbageCollectingConsumer.java:87) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) "in-memory-corruption-25" #95 daemon prio=5 os_prio=31 tid=0x00007fb8d405d800 nid=0xf903 waiting on condition [0x00000001293ef000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x0000000770d83250> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039) at java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:442) at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
- Eventually, the log gets filled with the following entries, but the process does not get terminated:
[in-memory-corruption-23] org.modeshape.jcr.cache.RepositoryCache - Repository '9f2434a6-4378-4fe7-a352-aea6c07f56b1' is still not fully initialized after 10 minutes. Aborting.
- If I terminate the Java process that was stuck and run the same test again without cleaning the H2 DB on the filesystem, I can never get to the actual execution of the test, because the members of the cluster necessary for testing cannot be stood up, i.e. two out of five members get started, but then nothing happens.
Summary
The MODE-2670 did make a big difference, but in a clustered environment such a difference ended up having an unexpected outcome, assuming the provided test case is valid. I sincerely hope that I am doing something wrong and I would be thrilled to have a confirmation of that. Otherwise, this problem can result in very much undesired consequences for a production system that relies on ModeShape.
hchiorean, knowing the additional details and problems provided in this JIRA, would it be possible for you to revisit the following decision?
As far as I'm concerned, after I fix
MODE-2670I will focus on something else because these complex test cases simply take up too much of my time. If in the meantime someone else in the community proposes a fix, I'm more than happy to review any PRs.
As far as complexity of the provided test case goes, unless I messed something up in terms of configuration, the presented scenario is fairly common in a clustered environment where multiple consumers work with a single node, e.g. creating files under a given folder. It is true that I had to manually craft the delay between retries for each thread, but that was done to make sure that the problem can be reproduced.
hchiorean and rhauch, as always, your help is greatly appreciated. The turnaround time for various issues is simply amazing, especially for an open-source project. Once again, my apologies for so many questions, however, I hope that in the long run your answers will help the entire community.