-
Enhancement
-
Resolution: Won't Do
-
Major
-
None
-
JDG 7.1.0 GA
-
None
- I have setup of 2 JDG 7.1 servers, which are supposed to be set for cross-site setup. They are connected through the RELAY2 protocol and have caches in the SYNC backup mode. Pretty much similar to the documentation setup: https://access.redhat.com/documentation/en-us/red_hat_jboss_data_grid/7.1/html/administration_and_configuration_guide/set_up_cross_datacenter_replication#configure_cross_datacenter_replication_remote_client_server_mode
- Then I have a simple Java application, which connects to the infinispan server through the hotrod (RemoteCache). I am seeing the deadlock when there is an attempt to write record to the same key "123" on both sites concurrently. There are those exceptions in the server.log of both servers:
20:30:15,461 ERROR [org.infinispan.interceptors.impl.InvocationContextInterceptor] (HotRodServerHandler-8-32) ISPN000136: Error executing command ReplaceCommand, writing keys [[B0x033E03313233]: The local cache sessions failed to backup data to the remote sites:
LON: org.infinispan.util.concurrent.TimeoutException: Timed out after 10 seconds waiting for a response from LON (sync, timeout=10000)
at org.infinispan.xsite.BackupSenderImpl.processFailedResponses(BackupSenderImpl.java:227)
at org.infinispan.xsite.BackupSenderImpl.processResponses(BackupSenderImpl.java:132)
at org.infinispan.xsite.BackupSenderImpl.processResponses(BackupSenderImpl.java:124)
at org.infinispan.interceptors.xsite.NonTransactionalBackupInterceptor.lambda$handleSingleKeyWriteCommand$0(NonTransactionalBackupInterceptor.java:58)
at org.infinispan.interceptors.xsite.NonTransactionalBackupInterceptor$$Lambda$303/1579852903.accept(Unknown Source)
at org.infinispan.interceptors.BaseAsyncInterceptor.invokeNextThenAccept(BaseAsyncInterceptor.java:108)
I am also attaching the files with thread dumps from both servers.
If I am analyzing the thread-dump correctly, I see that what happened is:
- Site1 transaction1: cache.replace("123", val)
- Site1 transaction1: lockManager.lock("123", ...) called from AbstractLockingInterceptor. Acquired "site1-lock".
- Site1 transaction1: BackupSender.backupWrite called for "123" and sending backup to Site2
Concurrently with it, I have on site2:
- Site2 transaction2: cache.replace("123", val);
- Site2 transaction2: lockManager.lock("123", ...) called from AbstractLockingInterceptor. Acquired "site2-lock".
- Site2 transaction2: BackupSender.backupWrite called for "123" and sending backup to Site1
- In the meantime, Site2 received backup from Site1 (triggered by Site1 transaction1). But BaseBackupReceiver on site2 needs to wait for Site2 transaction2, for the site2-lock, so cannot continue. But site1 transaction1 is waiting for the response from BaseBackupReceiver, so cannot continue.
- In the meantime, Site1 received backup from Site2 (triggered by Site2 transaction2). But BaseBackupReceiver on site1 needs to wait for Site1 transaction1, for the site1-lock, so cannot continue. But site2 transaction2 is waiting for the response from BaseBackupReceiver, so cannot continue.
So we have nice deadlock here, which is "unblocked" after 10 seconds due the BackupSender timeout.