Uploaded image for project: 'JGroups'
  1. JGroups
  2. JGRP-2360

DeadLock while acqiring a distributed lock consecutively by the same thread in a loop

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 4.1.2
    • 3.6.18, 4.1.1
    • None
    • Hide

      Described in the readme file in the attached maven project

      Show
      Described in the readme file in the attached maven project

    Description

      Deadlock intermittently happens when trying to acquire a distributed lock by the same VM, consecutively by the same thread in a loop. Here is a code snippet for which this issue can occur :

      for(String s : list){
         Lock lock=lock_service.getLock("test_lock_name");
         lock.lock();
         //perform bussines logic
         lock.unlock();
      }
      

      Running such loop I am getting dead look after a few loop iterations. In the attached logs program hanged after 3 iterations
      During the troubleshooting, I found out that lock_id is not being incremented for the new distributed lock. In the first two loop iterations everything was fine. At the third iteration lock_id didn't get increased:

      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: RELEASE_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: RELEASE_LOCK_OK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
      
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: RELEASE_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: RELEASE_LOCK_OK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
      
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: CREATE_LOCK[test_lock_name, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
      2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
      

      I've added few extra loggers into Jgroups-4.1.1.Final code and I realized that the second client lock was not removed from the client lock table before the creation of 3rd client lock. The issue lays in below piece of code. Owner consists of address and threadID. If the same thread, on the same VM, creates distributed lock consecutively and if there is an existing entry in the client lock table for the same owner, the new lock won't be created. The old client lock will be used to acquire a new distributed lock :

              protected synchronized ClientLock getLock(String name, Owner owner, boolean create_if_absent) {
                  Map<Owner,ClientLock> owners=table.get(name);
                  if(owners == null) {
                      if(!create_if_absent)
                          return null;
                      owners=Util.createConcurrentMap(20);
                      Map<Owner,ClientLock> existing=table.putIfAbsent(name,owners);
                      if(existing != null)
                          owners=existing;
                  }
                  ClientLock lock=owners.get(owner);
                  if(lock == null) {
                      if(!create_if_absent)
                          return null;
                      lock=createLock(name, owner);
                      owners.put(owner, lock);
                  }
                  return lock;
              }
      

      I believe that this issue was introduced by the fix for JGRP-2234 and it is caused by the race condition. The logic that deletes client lock from the client lock table is now executed when the client's VM receives RELEASE_LOCK_OK message from the coordinator. Previously this deletion was executed by the thread in which unlock() method was called. Now, it is executed by the separate thread which handles RELEASE_LOCK_OK from the coordinator and this is why we have a race condition here. Here is a sequence which leads to deadlock:

      1. Create client lock (lock_id=2)
      2. Send GRANT_LOCK (lock_id=2) to coordinator
      3. Receive LOCK_GRANTED (lock_id=2) from coordinator
      4. Send RELEASE_LOCK (lock_id=2) to coordinator
      5. Call look() method in the same thread (new client lock won't be created as there is an existing entry in the client lock table for this owner)
      6. Receive RELEASE_LOCK_OK and delete client lock from client lock table.
      7. Send GRANT_LOCK (lock_id=2) to coordinator
      8. Receive LOCK_GRANTED (lock_id=2) from coordinator
      9. No entry in the client lock table. It's not possible to get the thread which needs to be notified.

      I am attaching a simple program which can be used to reproduce and generated logs.

      Attachments

        1. DistributedLockRepoducer.zip
          4 kB
        2. DLTest.java
          0.9 kB
        3. log.log
          112 kB

        Activity

          People

            rhn-engineering-bban Bela Ban
            dklosek Daniel Klosinski (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: