Loading...

Type: Bug
Resolution: Won't Do
Priority: Major
Fix Version/s: 4.2.21
Affects Version/s: 4.2.10
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Steps to Reproduce:

Hide

Reproducing this issue is highly dependent on the test environment. The problem does not replicate on my laptop with a local cluster.

See the attached test application.

1) Adjust the jgroups_config_locking.xml file and adjust initial_hosts for TCPGOSSIP to point to three gossip routers, each on a separate VM.
2) Use the GossipRouter.sh script to launch the gossip router on each VM.
3) Run testNodes.sh on each VM.

The test will spin up 27 nodes on each VM and initiate the locking test. After the 27 nodes are established, an additional 16 nodes will join the cluster at random times and also start a locking test. Across the cluster a maximum of 129 nodes may be active across the VMs. The test time for each node will vary. As each node completes a test it will leave the cluster. The act of a node joining or leaving the cluster will cause a View change message to be sent to all the nodes. A View change will initiate a Merge operation with the Coordinator. The Coordinator will use the CENTRAL_LOCK2 protocol and attempt to rebuild the locking state.

In this test the locks are acquired in order: DWHITE_ISA_ENVELOPE, DWHITE_GS_ENVELOPE-1, DWHITE_ST_ENVELOPE-1. A node cannot acquire another lock in the chain without having first acquired the previous lock ID in the list. When the problem occurs, the locking state is rebuilt incorrectly.

When the lock state is rebuilt incorrectly, there will be a discrepancy between the server lock table and client lock tables and the nodes will be "deadlocked" with each node waiting on a lock that another node does not own.

Show
Reproducing this issue is highly dependent on the test environment. The problem does not replicate on my laptop with a local cluster. See the attached test application. 1) Adjust the jgroups_config_locking.xml file and adjust initial_hosts for TCPGOSSIP to point to three gossip routers, each on a separate VM. 2) Use the GossipRouter.sh script to launch the gossip router on each VM. 3) Run testNodes.sh on each VM. The test will spin up 27 nodes on each VM and initiate the locking test. After the 27 nodes are established, an additional 16 nodes will join the cluster at random times and also start a locking test. Across the cluster a maximum of 129 nodes may be active across the VMs. The test time for each node will vary. As each node completes a test it will leave the cluster. The act of a node joining or leaving the cluster will cause a View change message to be sent to all the nodes. A View change will initiate a Merge operation with the Coordinator. The Coordinator will use the CENTRAL_LOCK2 protocol and attempt to rebuild the locking state. In this test the locks are acquired in order: DWHITE_ISA_ENVELOPE, DWHITE_GS_ENVELOPE-1, DWHITE_ST_ENVELOPE-1. A node cannot acquire another lock in the chain without having first acquired the previous lock ID in the list. When the problem occurs, the locking state is rebuilt incorrectly. When the lock state is rebuilt incorrectly, there will be a discrepancy between the server lock table and client lock tables and the nodes will be "deadlocked" with each node waiting on a lock that another node does not own.

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Configuration: a cluster containing three VMs, three gossip routers (one per VM), and each VM has approxiately 40 JVM nodes. The cluster is using the CENTRAL_LOCK2 protocol with the lock service. The MERGE3 protocol is used to resolve sub-views.

Each VM:
MemTotal: 164873940 kB
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping : 0
microcode : 0x2006b06
cpu MHz : 2294.609
cache size : 25344 KB
16 processors

With this configuration nodes may leave the cluster after performing work, and new nodes may join and start work. Each node is performing locking operations where the lock IDs must be obtained in order: DWHITE_ISA_ENVELOPE, DWHITE_GS_ENVELOPE-1, DWHITE_ST_ENVELOPE-1. A node cannot acquire another lock in the chain without having first acquired the previous lock ID in the list (i.e. ISA -> GS -> ST).

Whenever a node leaves or joins the cluster this action produces a View change and requires the Coordinator to use the MERGE3 and CENTRAL_LOCK2 protocols in order to resolve the sub-views and rebuild the lock state table. With a View change the Coordinator itself may also change to a different node in the cluster.

Unfortunately, the lock state table is not always re-built correctly by the Coordinator. When the lock state is rebuilt incorrectly, there will be a discrepancy between the server lock table and client lock tables and the nodes will be "deadlocked" with each node waiting on a lock that another node does not own.

The logs from our QA show the following deadlock and table discrepancies:

ISA_ENVELOPE_DWHITE: jazzbox02008-51093 (newnode5), node client table has ISA_ENVELOPE_DWHITE, waiting on GS_ENVELOPE_DWHITE-1 (good)
GS_ENVELOPE_DWHITE-1: jazzbox02008-24988 (newnode9), node client table has nothing, waiting on ISA_ENVELOPE_DWHITE (bad)
ST_ENVELOPE_DWHITE-1: jazzbox02008-24988 (newnode9), node client table has nothing, waiting on ISA_ENVELOPE_DWHITE (bad)

Checking the respective client lock tables of the other nodes in the cluster reveals more discrepancies with the server lock table:
client locks show jazzbox02008-57169 (newnode7) has GS_ENVELOPE_DWHITE-1
client locks show no ownership for ST_ENVELOPE_DWHITE-1

The following warnings are noted in the jgroups logs. For example:
jazzbox02008-51093: lock ISA_ENVELOPE_DWHITE requested by jazzbox02008-24988::104 is already present

jazzbox02008-57169: timeout waiting for RELEASE_LOCK_OK response for lock GS_ENVELOPE_DWHITE-1 (id=6935, locked=true, owner=jazzbox02008-57169::135)

JGRP000011: jazzbox02008-4353: dropped message batch from non-member jazzbox02007-26179 (view=[jazzbox02008-51093|137] (6) [jazzbox02008-51093, jazzbox02007-34942, jazzbox02008-24988, jazzbox02008-4353, jazzbox02008-44059, jazzbox02008-29636])

jazzbox02008-60504: failed sending message to <ip_address>: java.nio.channels.ClosedSelectorException

jazzbox02007-26179: terminating after 10 unsuccessful LEAVE attempts (waited 20002 ms):

I hope we can determine why the lock table is sometimes not rebuilt properly. My guess is that the ordering of the View change messages coupled with the timestamps of the lock/unlock requests are key factors to resolving the problem.

Another feature that would be very helpful is the capability to specify a timeout for a lock. The JGroups Coordinator would be responsible for determining when a lock times out and would revoke it.

See the attached log files for a captured case where the lock table is not rebuilt correctly.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

jgroups_locktest_app.tar.gz
2.21 MB
2022/04/19 10:50 PM
jgroups_locktest_logs.tar.gz
2.21 MB
2022/04/19 10:50 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates