Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 4.1.2
Affects Version/s: 3.6.18, 4.1.1
Labels:
None

Steps to Reproduce:

Hide

Described in the readme file in the attached maven project

Show
Described in the readme file in the attached maven project

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Deadlock intermittently happens when trying to acquire a distributed lock by the same VM, consecutively by the same thread in a loop. Here is a code snippet for which this issue can occur :

for(String s : list){
   Lock lock=lock_service.getLock("test_lock_name");
   lock.lock();
   //perform bussines logic
   lock.unlock();
}

Running such loop I am getting dead look after a few loop iterations. In the attached logs program hanged after 3 iterations
During the troubleshooting, I found out that lock_id is not being incremented for the new distributed lock. In the first two loop iterations everything was fine. At the third iteration lock_id didn't get increased:

2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: RELEASE_LOCK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: RELEASE_LOCK_OK[test_lock_name, lock_id=1, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]

2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: RELEASE_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: RELEASE_LOCK_OK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]

2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 --> svc-1-sps-4688: GRANT_LOCK[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: CREATE_LOCK[test_lock_name, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]
2019-07-15 16:03:32 TRACE CENTRAL_LOCK:163 - svc-2-sps-34594 <-- svc-1-sps-4688: LOCK_GRANTED[test_lock_name, lock_id=2, owner=svc-2-sps-34594::1, sender=svc-1-sps-4688]

I've added few extra loggers into Jgroups-4.1.1.Final code and I realized that the second client lock was not removed from the client lock table before the creation of 3rd client lock. The issue lays in below piece of code. Owner consists of address and threadID. If the same thread, on the same VM, creates distributed lock consecutively and if there is an existing entry in the client lock table for the same owner, the new lock won't be created. The old client lock will be used to acquire a new distributed lock :

        protected synchronized ClientLock getLock(String name, Owner owner, boolean create_if_absent) {
            Map<Owner,ClientLock> owners=table.get(name);
            if(owners == null) {
                if(!create_if_absent)
                    return null;
                owners=Util.createConcurrentMap(20);
                Map<Owner,ClientLock> existing=table.putIfAbsent(name,owners);
                if(existing != null)
                    owners=existing;
            }
            ClientLock lock=owners.get(owner);
            if(lock == null) {
                if(!create_if_absent)
                    return null;
                lock=createLock(name, owner);
                owners.put(owner, lock);
            }
            return lock;
        }

I believe that this issue was introduced by the fix for ~~JGRP-2234~~ and it is caused by the race condition. The logic that deletes client lock from the client lock table is now executed when the client's VM receives RELEASE_LOCK_OK message from the coordinator. Previously this deletion was executed by the thread in which unlock() method was called. Now, it is executed by the separate thread which handles RELEASE_LOCK_OK from the coordinator and this is why we have a race condition here. Here is a sequence which leads to deadlock:

1. Create client lock (lock_id=2)
2. Send GRANT_LOCK (lock_id=2) to coordinator
3. Receive LOCK_GRANTED (lock_id=2) from coordinator
4. Send RELEASE_LOCK (lock_id=2) to coordinator
5. Call look() method in the same thread (new client lock won't be created as there is an existing entry in the client lock table for this owner)
6. Receive RELEASE_LOCK_OK and delete client lock from client lock table.
7. Send GRANT_LOCK (lock_id=2) to coordinator
8. Receive LOCK_GRANTED (lock_id=2) from coordinator
9. No entry in the client lock table. It's not possible to get the thread which needs to be notified.

I am attaching a simple program which can be used to reproduce and generated logs.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

DistributedLockRepoducer.zip
4 kB
2019/07/15 4:14 PM
DLTest.java
0.9 kB
2019/07/15 4:15 PM
log.log
112 kB
2019/07/15 4:07 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates