Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-15181

Deadlock when creating caches with a zero-capacity node and a single stateful node

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 14.0.18.Final, 15.0.0.Final
    • 14.0.17.Final, 15.0.0.Final
    • Core
    • None

      Deadlock is periodically encountered in the Operator testsuite. The workflow is as follows:

      1. Infinispan cluster with 1 pod created; test-backup-restore-no-auth-target-0
      2. Zero-capacity pod joins the cluster; restore
      3. Backup restore operation initiated on restore pod, but the restore operation ultimately times out with:

      09:58:08,434 ERROR (non-blocking-thread--p2-t18) [org.infinispan.server.core.backup.BackupManagerImpl] ISPN005040: Cannot restore cluster backup '/opt/infinispan/backups/backup/backup.zip': java.util.concurrent.CompletionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 27 from test-backup-restore-no-auth-target-0-3137 after 15 seconds
      	at org.infinispan.util.concurrent.CompletionStages.join(CompletionStages.java:87)
      	at org.infinispan.server.core.backup.BackupReader.lambda$restore$0(BackupReader.java:73)
      	at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804)
      	at org.jboss.threads.ContextClassLoaderSavingRunnable.run(ContextClassLoaderSavingRunnable.java:35)
      	at org.jboss.threads.EnhancedQueueExecutor.safeRun(EnhancedQueueExecutor.java:1982)
      	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.doRunTask(EnhancedQueueExecutor.java:1486)
      	at org.jboss.threads.EnhancedQueueExecutor$ThreadBody.run(EnhancedQueueExecutor.java:1377)
      	at java.base/java.lang.Thread.run(Thread.java:833)
      Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 27 from test-backup-restore-no-auth-target-0-3137 after 15 seconds
      	at org.infinispan.remoting.transport.impl.SingleTargetRequest.onTimeout(SingleTargetRequest.java:86)
      	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:88)
      	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
      	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
      	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
      	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
      	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
      	... 1 more
      

      Attached logs for 14.0.17.Final server with some additional trace logging.

      Failure State

      The deadlock seems to occur when the NonTransactionalLockingInterceptor detects that the test-backup-restore-no-auth-target-0 pod is the primary owner for the org.infinispan.CONFIG cache. The transactional interceptors are skipped and the exampleCache is never created on test-backup-restore-no-auth-target-0. Consequently, the getOrCreate operation on the zero-capacity timesout and the restore fails.

      test-backup-restore-no-auth-target-0.log

      09:57:53,427 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler] Attempting to execute CacheRpcCommand: SingleRpcCommand{cacheName='org.infinispan.CONFIG', command=PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@1fea8126, flags=[], commandInvocationId=CommandInvocation:restore-17606:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_EXPECTED, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}} [sender=restore-17606]
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.remoting.inboundhandler.BasePerCacheInboundInvocationHandler] Calling perform() on SingleRpcCommand{cacheName='org.infinispan.CONFIG', command=PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@1fea8126, flags=[], commandInvocationId=CommandInvocation:restore-17606:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_EXPECTED, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}}
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.impl.InvocationContextInterceptor] Invoked org.infinispan.CONFIG with command PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@1fea8126, flags=[], commandInvocationId=CommandInvocation:restore-17606:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_EXPECTED, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5} and InvocationContext [SingleKeyNonTxInvocationContext{isLocked=false, key=null, cacheEntry=null, origin=restore-17606, lockOwner=CommandInvocation:restore-17606:21}]
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.statetransfer.StateTransferInterceptor] handleNonTxWriteCommand for command PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@1fea8126, flags=[], commandInvocationId=CommandInvocation:restore-17606:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_EXPECTED, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}, topology id 5
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] cdl=org.infinispan.interceptors.locking.ClusteringDependentLogic$ReplicationLogic
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] topology=CacheTopology{id=5, phase=NO_REBALANCE, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[test-backup-restore-no-auth-target-0-3137: 125+131, restore-17606: 131+125]}, pendingCH=null, unionCH=null, actualMembers=[test-backup-restore-no-auth-target-0-3137, restore-17606], persistentUUIDs=[4b6dc311-d00a-4eed-9432-0f2f78ae19cd, 615667c1-e88e-427c-a130-07174d192821]}
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] di=DistributionInfo{primary, segmentId=57, readOwners=[test-backup-restore-no-auth-target-0-3137, restore-17606], writeOwners=[test-backup-restore-no-auth-target-0-3137, restore-17606]}
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] primary=true
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] Are (test-backup-restore-no-auth-target-0-3137) we the lock owners for segment '57'? true
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.interceptors.impl.CallInterceptor] Invoking: PutKeyValueCommand
      09:57:53,428 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.remoting.transport.jgroups.JGroupsTransport] test-backup-restore-no-auth-target-0-3137 sending broadcast request 9 to [test-backup-restore-no-auth-target-0-3137, restore-17606]: SingleRpcCommand{cacheName='org.infinispan.CONFIG', command=PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@1fea8126, flags=[BACKUP_WRITE], commandInvocationId=CommandInvocation:restore-17606:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}}
      09:57:53,460 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.remoting.transport.jgroups.JGroupsTransport] test-backup-restore-no-auth-target-0-3137 received request 28 from restore-17606: TopologyJoinCommand{cacheName='someCache', origin=null, joinInfo=CacheJoinInfo{consistentHashFactory=org.infinispan.distribution.ch.impl.SyncConsistentHashFactory@ffffd8e9, numSegments=256, numOwners=2, timeout=240000, cacheMode=DIST_SYNC, persistentUUID=615667c1-e88e-427c-a130-07174d192821, persistentStateChecksum=Optional.empty}, viewId=1}
      09:57:53,460 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler] Attempting to execute non-CacheRpcCommand: TopologyJoinCommand{cacheName='someCache', origin=restore-17606, joinInfo=CacheJoinInfo{consistentHashFactory=org.infinispan.distribution.ch.impl.SyncConsistentHashFactory@ffffd8e9, numSegments=256, numOwners=2, timeout=240000, cacheMode=DIST_SYNC, persistentUUID=615667c1-e88e-427c-a130-07174d192821, persistentStateChecksum=Optional.empty}, viewId=1} [sender=restore-17606]
      09:57:53,461 TRACE (jgroups-6,test-backup-restore-no-auth-target-0-3137) [org.infinispan.topology.ClusterCacheStatus] Cache someCache initialized. Persisted state? false
      

      Succesful State

      When the test-backup-restore-no-auth-target-0 pod is NOT identified as the primary, the getOrCreate operation proceeds as expected:

      test-backup-restore-no-auth-target-0-fail.log

      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.remoting.inboundhandler.GlobalInboundInvocationHandler] Attempting to execute CacheRpcCommand: SingleRpcCommand{cacheName='org.infinispan.CONFIG', command=PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@675fa237, flags=[BACKUP_WRITE], commandInvocationId=CommandInvocation:restore-37521:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}} [sender=restore-37521]
      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.remoting.inboundhandler.BasePerCacheInboundInvocationHandler] Calling perform() on SingleRpcCommand{cacheName='org.infinispan.CONFIG', command=PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@675fa237, flags=[BACKUP_WRITE], commandInvocationId=CommandInvocation:restore-37521:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}}
      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.InvocationContextInterceptor] Invoked org.infinispan.CONFIG with command PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@675fa237, flags=[BACKUP_WRITE], commandInvocationId=CommandInvocation:restore-37521:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5} and InvocationContext [SingleKeyNonTxInvocationContext{isLocked=false, key=null, cacheEntry=null, origin=restore-37521, lockOwner=CommandInvocation:restore-37521:21}]
      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.statetransfer.StateTransferInterceptor] handleNonTxWriteCommand for command PutKeyValueCommand{key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@675fa237, flags=[BACKUP_WRITE], commandInvocationId=CommandInvocation:restore-37521:21, putIfAbsent=true, returnEntry=false, valueMatcher=MATCH_ALWAYS, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, internalMetadata=null, successful=true, topologyId=5}, topology id 5
      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] cdl=org.infinispan.interceptors.locking.ClusteringDependentLogic$ReplicationLogic
      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] topology=CacheTopology{id=5, phase=NO_REBALANCE, rebalanceId=2, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[test-backup-restore-no-auth-target-0-3985: 125+131, restore-37521: 131+125]}, pendingCH=null, unionCH=null, actualMembers=[test-backup-restore-no-auth-target-0-3985, restore-37521], persistentUUIDs=[6315c357-f04f-4505-9385-7330dca4b0b7, 1b728e4f-35f8-4219-b2ea-e28c766133c0]}
      10:04:09,399 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] di=DistributionInfo{read+write, segmentId=57, readOwners=[restore-37521, test-backup-restore-no-auth-target-0-3985], writeOwners=[restore-37521, test-backup-restore-no-auth-target-0-3985]}
      10:04:09,400 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] primary=false
      10:04:09,400 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.locking.NonTransactionalLockingInterceptor] Are (test-backup-restore-no-auth-target-0-3985) we the lock owners for segment '57'? false
      10:04:09,400 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.CallInterceptor] Invoking: PutKeyValueCommand
      10:04:09,400 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.EntryWrappingInterceptor] About to commit entry ReadCommittedEntry(3a5d9353){key=ScopedState{scope='cache', name='someCache'}, value=org.infinispan.globalstate.impl.CacheState@675fa237, oldValue=null, isCreated=true, isChanged=true, isRemoved=false, isExpired=false, isCommited=false, skipLookup=false, metadata=EmbeddedExpirableMetadata{version=null, lifespan=-1, maxIdle=-1}, oldMetadata=null, internalMetadata=null}
      10:04:09,400 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.statetransfer.CommitManager] Trying to commit. Key=ScopedState{scope='cache', name='someCache'}. Operation Flag=null, L1 write/invalidation=false
      10:04:09,400 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.statetransfer.CommitManager] Committing key=ScopedState{scope='cache', name='someCache'}. It is a L1 invalidation or a normal put and no tracking is enabled!
      10:04:09,402 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.InvocationContextInterceptor] Invoked org.infinispan.PERMISSIONS with command EntrySetCommand{, flags=[CACHE_MODE_LOCAL, SKIP_CACHE_LOAD]} and InvocationContext [org.infinispan.context.impl.NonTxInvocationContext@75909ba8]
      10:04:09,402 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.CallInterceptor] Invoking: EntrySetCommand
      10:04:09,406 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.InvocationContextInterceptor] Invoked org.infinispan.PERMISSIONS with command EntrySetCommand{, flags=[CACHE_MODE_LOCAL, SKIP_CACHE_LOAD]} and InvocationContext [org.infinispan.context.impl.NonTxInvocationContext@189046c4]
      10:04:09,406 TRACE (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.interceptors.impl.CallInterceptor] Invoking: EntrySetCommand
      10:04:09,465 DEBUG (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.globalstate.impl.GlobalConfigurationManagerImpl] Creating cache someCache from global state
      10:04:09,472 DEBUG (jgroups-5,test-backup-restore-no-auth-target-0-3985) [org.infinispan.globalstate.impl.VolatileLocalConfigurationStorage] Defined cache 'someCache' on 'test-backup-restore-no-auth-target-0-3985' using Configuration{simpleCache=false, clustering=[[type=DISTRIBUTION, mode=true, remote-timeout=15000, invalidation-batch-size=128, bias-acquisition=ON_WRITE, bias-lifespan=300000], hash=[consistent-hash-factory=null, owners=2, segments=256, capacity-factor=1.0, key-partitioner=HashFunctionPartitioner{hashFunction=MurmurHash3, ns=256}], l1=[enabled=false, invalidation-threshold=0, l1-lifespan=600000, l1-cleanup-interval=60000], state-transfer=[enabled=true, timeout=240000, chunk-size=512, await-initial-transfer=true], partition-handling=[when-split=ALLOW_READ_WRITES, merge-policy=NONE]], customInterceptors=[interceptors=[]], encoding=[[media-type=null], key=[media-type=null], value=[media-type=null]], expiration=[lifespan=-1, max-idle=-1, reaperEnabled=true, interval=60000, touch=SYNC], query=[properties={}, default-max-results=100, hit-count-accuracy=10000], indexing=[properties={}, index=null, auto-config=false, key-transformers={}, indexed-entities=[], enabled=false, storage=filesystem, startup-mode=NONE, path=null], reader=[refresh-interval=0], writer=[[thread-pool-size=1, queue-count=1, queue-size=null, commit-interval=null, ram-buffer-size=null, max-buffered-entries=null, low-level-trace=false], index-merge=[max-entries=null, factor=null, min-size=null, max-size=null, max-forced-size=null, calibrate-by-deletes=null]], invocationBatching=[enabled=false], locking=[concurrency-level=32, isolation=REPEATABLE_READ, acquire-timeout=10000, striping=false], memory=[storage=HEAP, max-size=null, max-count=-1, when-full=NONE], modules={}, persistence=[passivation=false, availability-interval=1000, connection-attempts=10, connection-interval=50], stores=[], security=[authorization=[enabled=false, roles=[]]], sites=[[merge-policy=org.infinispan.xsite.spi.DefaultXSiteEntryMergePolicy@262fd64f, max-cleanup-delay=30000, tombstone-map-size=512000], backups=[], backup-for=[remote-cache=null, remote-site=null]], statistics=[statistics=true, statistics-available=true], transaction=[[auto-commit=true, stop-timeout=30000, locking=OPTIMISTIC, transaction-manager-lookup=org.infinispan.transaction.lookup.GenericTransactionManagerLookup@34141dc8, transaction-synchronization-registry-lookup=null, mode=NON_TRANSACTIONAL, synchronization=false, single-phase-auto-commit=false, reaper-interval=30000, complete-timeout=60000, notifications=true], recovery=[enabled=false, recovery-cache=__recoveryInfoCacheName__]], unsafe=[unreliable-return-values=false], template=false}
      

        1. restore.log
          792 kB
        2. restore-fail.log
          733 kB
        3. test-backup-restore-no-auth-target-0.log
          1.56 MB
        4. test-backup-restore-no-auth-target-0-1.log
          1.56 MB

              remerson@redhat.com Ryan Emerson
              remerson@redhat.com Ryan Emerson
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: