Uploaded image for project: 'ModeShape'
  1. ModeShape
  2. MODE-2665

Database deadlocks can occur when lock() and unlock() are called under certain conditions

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 5.4.0.Final
    • 5.2.0.Final
    • JCR
    • None
    • Hide

      I used 15 threads running the lock()... unlock() code posted in the description against 6 instances of the JCR repository on an Oracle database. It is difficult to reproduce, and requires the following conditions to be met:

      • two threads to race to call RepositoryLockManager.lock(...)
      • one thread succeeds, performs work, and attempts to call RepositoryLockManager.unlock(...)
      • the second thread then attempts to retry to acquire the lock by calling RepositoryLockManager.lock(...)
      Show
      I used 15 threads running the lock()... unlock() code posted in the description against 6 instances of the JCR repository on an Oracle database. It is difficult to reproduce, and requires the following conditions to be met: two threads to race to call RepositoryLockManager.lock(...) one thread succeeds, performs work, and attempts to call RepositoryLockManager.unlock(...) the second thread then attempts to retry to acquire the lock by calling RepositoryLockManager.lock(...)

      I have multiple threads running the code below, calling into a cluster of the ModeShape repositories on an Oracle database:

      session.getWorkspace().getLockManager().lock(parent.getPath(), ...);
      try {
        parent.addNode("child-" + i);
      } finally {
        session.getWorkspace().getLockManager().unlock(parent.getPath());
      }
      

      I noticed the database deadlock when I was investigating hung threads. After going through the logs from the WritableSessionCache, it looks like the order of the database locks is determined by RepositoryLockManager.lock(...) and RepositoryLockManager.unlock(...)), and are contributing to the database deadlock.

      Here's the relevant output from WritableSessionCache's logging that demonstrates the deadlock:

      2017-01-31 18:26:09,438-0600 [add-node-parallel-10] DEBUG org.modeshape.jcr.cache.document.WritableSessionCache - Locked the nodes: [e60d4aa317f1e7mode:lock-e60d4aa7505d64e7751d4d-ff9a-45d3-8453-c51e87e4df36, e60d4aa317f1e7mode:locks]
      ...
      2017-01-31 18:26:09,439-0600 [add-node-parallel-6] DEBUG org.modeshape.jcr.cache.document.WritableSessionCache - Locked the nodes: [e60d4aa7505d64e7751d4d-ff9a-45d3-8453-c51e87e4df36]
      ...
      2017-01-31 18:26:09,448-0600 [add-node-parallel-10] DEBUG org.modeshape.jcr.cache.document.WritableSessionCache - Attempting to the lock nodes: [e60d4aa7505d64e7751d4d-ff9a-45d3-8453-c51e87e4df36]
      ...
      2017-01-31 18:26:09,449-0600 [add-node-parallel-6] DEBUG org.modeshape.jcr.cache.document.WritableSessionCache - Attempting to the lock nodes: [e60d4aa317f1e7mode:locks, e60d4aa317f1e7mode:lock-e60d4aa7505d64e7751d4d-ff9a-45d3-8453-c51e87e4df36]
      ...

      From looking through the source code in RepositoryLockManager, the lock(...) method saves the lock in the system session first, then saves the node being lock.

      While unlock(...) saves the node being locked first, then deletes the lock itself.

      This means that when lock(...) and unlock(...) are called at around the same time, lock(...) would acquire a db lock for the JCR lock itself, and waits for the db lock to be released for the JCR node being locked.
      unlock(...) would hold a db lock for the JCR node being locked and waits to acquire a db lock for the JCR lock itself.

      Proposed solution:
      I believe that changing the save order in RepositoryLockManager's unlock(...) would resolve the deadlock. I forked ModeShape, made the change and wasn't seeing any more deadlocks or hung threads. I don't know enough context around why the save order is the way it is, so I am not sure if this is the best/correct solution.

            hchiorean Horia Chiorean (Inactive)
            regimus Reggie Y (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: