Loading...

Type: Bug
Resolution: Can't Do
Priority: Major
Fix Version/s: 4.2.12, 5.1.6
Affects Version/s: None
Labels:
None

Blocked:
False
Ready:
False
Release Note Text:
Undefined
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

2 emails from D. White:

When a node thread is killed the JChannel/LockService still remains active because the node JVM is not killed. A new worker thread is created to replace the thread that was killed. In this case, the cluster view has not changed and therefore the locks remain.
When the node JVM process is killed, that action triggers a cluster view change which is received by the Coordinator. In this case, the server lock state is rebuilt and the locks are released.

I think the following will help:

Setup: 3 node cluster, each node with two worker threads. Each set of worker threads has access to the parent node JChannel.
node1 (Coordinator)
node2
node3

node2 thread1 acquires lock on resource ENV:ISA_IEA:1
node2 thread1 acquires lock on resource ENV:GS_GE:1
node2 thread2 requests lock on resource ENV:ISA_IEA:1

node1 thread1 requests lock on resource ENV:ISA_IEA:1
node1 thread2 requests lock on resource ENV:ISA_IEA:1
node3 thread1 requests lock on resource ENV:ISA_IEA:1
node3 thread2 requests lock on resource ENV:ISA_IEA:1

Scenario #1:
node2 thread1 runs too long, does not respond to soft shutdown, and the node JVM process killed by watch dog service.
[SPEChannelAdapter] viewAccepted received by Coordinator node1.
Both locks are released.

Scenario #2:
node2 thread1 runs too long, and the soft shutdown kills thread1 leaving the server locks in place and the node2 JVM process running.
Watch dog detects locks held too long for ENV:ISA_IEA:1 and ENV:GS_GE:1, and issues RELEASE_LOCK messages from the Coordinator with the proper Owner.
ENV:GS_GE:1 is released.
ENV:ISA_IEA:1 remains locked, seemingly due to the presence of a GRANT_LOCK request from node2 thread2.

Scenario #3 (slight variation on #2):
node2 thread1 runs too long, and the soft shutdown kills thread1 leaving the server locks in place and the node2 JVM process running.
Watch dog detects lock held too long on ENV:ISA_IEA:1 and ENV:GS_GE:1 and issues RELEASE_LOCK from the Coordinator with proper Owner.
Watch dog also removes GRANT_LOCK request for ENV:ISA_IEA:1 from node2 thread2.
Now both locks are released.
The presence of GRANT_LOCK requests from node1 and node3 does not prevent the release of the lock for ENV:ISA_IEA:1 held by node2.

Email 2:

Yes, we acquire a lock within a try/catch block and release with finally.

In production, each JVM has two worker threads. If any of the threads runs too a long, a monitor task force kills the JVM process. If there are acquired locks they do not get released from the unlock call in the finally block. Usually a JVM is killed because a bad customer map runs too long and the other thread with acquired locks becomes "collateral damage". Not every business scenario uses locks. Therefore, the "orphan lock" scenario doesn't happen every time a JVM process is killed. Also, both threads are not always active.

We use the CENTRAL_LOCK2 protocol. For some reason the locks acquired from the killed process may remain in the server locks table. On occasion, the existing Coordinator doesn't detect the "orphan" locks and revoke them.

Does a view change where the Coordinator has not changed cause that Coordinator to rebuild the lock state? In a view change where the Coordinator does change, that seems to fix the problem because the new Coordinator rebuilds the lock state table.
In the case where a new Coordinator is assigned, do the state transfer protocols need to be in the configuration (e.g. BARRIER, pbcast.STATE_TRANSFER) in order for the new Coordinator to correctly re-establish the lock state? I don't think so because CENTRAL_LOCK2 does not use state-transfer; the Coordinator rebuilds the lock state.

To alleviate this problem, we have a lock monitor thread which runs on the Coordinator node and keeps track of how long each lock has been held. Since no flow can run more than an hour any lock held for more is definitely an orphan. The lock monitor task issues RELEASE_LOCK requests using the owner address of the orphan lock. The RELEASE_LOCK message works in all cases except where there are pending GRANT_LOCK requests in the queue from the same owner address of the held lock. If the GRANT_LOCK requests are from other addresses, the RELEASE_LOCK request works.

In order to simulate the problem, a test application ignores the unlock operation in the finally block purposefully creating the "orphan" in the server locks table. Other instances of the test application are running with normal lock/unlock operations. The lock monitor thread on the Coordinator subsequently detects the "lock held too long" orphan condition and issues the RELEASE_LOCK request on behalf of the orphan lock owner. Whenever a lock is successfully acquired, the lock monitor task internally keeps track of the acquired timestamp, owner, and lock ID.

I'd love to get rid of the complex lock monitor and ensure lock revoke operations are initiated by the Coordinator via the CENTRAL_LOCK2 protocol.

Another enhancement that would completely solve this problem: Allow a timeout to be specified for holding a lock. The JGroups protocol would then revoke the lock if the timeout threshold were reached.

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates