Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-12423

Infispan thread freezing (STUCK) - dead lock occured

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • 11.0.4.Final
    • Core
    • None
    • Hide

      Just generate huge read/write load

      Show
      Just generate huge read/write load
    • Undefined

      During huge load some threads hangs on

      java.lang.Thread.State: WAITING (parking)             
       java.lang.Thread.State: WAITING (parking)                   
      at sun.misc.Unsafe.park(Native Method)                     

      • parking to wait for  <0x00000005d2e654a8> (a java.util.concurrent.locks.StampedLock)                     
        at java.util.concurrent.locks.StampedLock.acquireRead(StampedLock.java:1215)                     
        at java.util.concurrent.locks.StampedLock.readLock(StampedLock.java:428)             
        at org.infinispan.container.offheap.OffHeapConcurrentMap.peekOrGet(OffHeapConcurrentMap.java:615)                     
        at org.infinispan.container.offheap.OffHeapConcurrentMap.peek(OffHeapConcurrentMap.java:682) 

      I attached Infinispan configuration and three thread dumps:

      • off heap storage enabled (Thread dump.txt)
      • heap storage enabled (Thread dump - heap storage.txt)
      • off heap storage enabled and replicated cache mode="SYNC" (thread dump sync.txt)

      Under high load, Infinspan freezes 100% of the cases.

        1. cache-configuration-jdbc.xml
          4 kB
        2. cache-configuration-jdbc.xml
          4 kB
        3. Thread dump.txt
          2.96 MB
        4. Thread dump - heap storage.txt
          1.12 MB
        5. Thread dump sync.txt
          1.22 MB

            [ISPN-12423] Infispan thread freezing (STUCK) - dead lock occured

            Infinispan issue tracking has been migrated to GitHub issues: https://github.com/infinispan/infinispan/issues
            If you still want this issue to be worked on, create a new issue on GitHub and link this issue.

            Tristan Tarrant added a comment - Infinispan issue tracking has been migrated to GitHub issues: https://github.com/infinispan/infinispan/issues If you still want this issue to be worked on, create a new issue on GitHub and link this issue.

            c0ff75 if you are unable to reproduce the problem with trace logging enabled, or if the logs are 10s of GBs, it may be easier to find the root of the problem from a heap dump.

            We know something is starting a huge chain of CompletableFuture that's possibly leading to stack overflow (e.g. thread "non-blocking-thread--p2-t14" in the heap storage thread dump). The most interesting stack frames ones at the bottom are missing from the thread dumps, because the stack is so deep, but it should be possible to see the full stack in a heap dump.

            Dan Berindei (Inactive) added a comment - c0ff75 if you are unable to reproduce the problem with trace logging enabled, or if the logs are 10s of GBs, it may be easier to find the root of the problem from a heap dump. We know something is starting a huge chain of CompletableFuture that's possibly leading to stack overflow (e.g. thread "non-blocking-thread--p2-t14" in the heap storage thread dump). The most interesting stack frames ones at the bottom are missing from the thread dumps, because the stack is so deep, but it should be possible to see the full stack in a heap dump.

            Will Burns added a comment -

            A stack trace of it hung and the log together would show the best picture. Just make sure your log format includes the thread name

            Will Burns added a comment - A stack trace of it hung and the log together would show the best picture. Just make sure your log format includes the thread name

            Will Burns added a comment -

            Just add

            <Logger name="org.infinispan" level="TRACE"/>
            

            to your loggers.

            Will Burns added a comment - Just add <Logger name= "org.infinispan" level= "TRACE" /> to your loggers.

            Please write me how to correctly enable TRACE level for Infinispan. I use Log4J2 in my app without Sf4J

            Both caches are same, but in testing I use only first cache - GS46CACHE_GS46NEW

            Evgenii Balakhonov (Inactive) added a comment - Please write me how to correctly enable TRACE level for Infinispan. I use Log4J2 in my app without Sf4J Both caches are same, but in testing I use only first cache - GS46CACHE_GS46NEW

            Will Burns added a comment -

            Also just to confirm there are 2 caches in the xml file you posted. Which one is the one that was having the issue?

            Will Burns added a comment - Also just to confirm there are 2 caches in the xml file you posted. Which one is the one that was having the issue?

            Will Burns added a comment -

            Is it possible you could run with org.infinispan set to TRACE level? I know it will be a really big file, but maybe just keep the last bit of it around right before it started to hang?

            Will Burns added a comment - Is it possible you could run with org.infinispan set to TRACE level? I know it will be a really big file, but maybe just keep the last bit of it around right before it started to hang?

            Will Burns,

            Now I tries to switch on old binary memory storage mode:

            <memory max-size="10GB" when-full="REMOVE" storage="BINARY"/>

            And in this mode Infinispan works reliably :-\

            But if I return back storage="OFF_HEAP" then I have after 30-40 minutes high load:

            "[ACTIVE] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" #31 daemon prio=5 os_prio=0 tid=0x00007f5d8c9a4800 nid=0x2005 waiting on condition [0x00007f5d7e184000]

            java.lang.Thread.State: WAITING (parking)
            at sun.misc.Unsafe.park(Native Method)

            • parking to wait for <0x00000005d72b1140> (a java.util.concurrent.locks.StampedLock)
              at java.util.concurrent.locks.StampedLock.acquireRead(StampedLock.java:1215)
              at java.util.concurrent.locks.StampedLock.readLock(StampedLock.java:428)
              at org.infinispan.container.offheap.OffHeapConcurrentMap.peekOrGet(OffHeapConcurrentMap.java:615)

            Evgenii Balakhonov (Inactive) added a comment - Will Burns, Now I tries to switch on old binary memory storage mode: <memory max-size="10GB" when-full="REMOVE" storage="BINARY"/> And in this mode Infinispan works reliably :-\ But if I return back storage="OFF_HEAP" then I have after 30-40 minutes high load: " [ACTIVE] ExecuteThread: '3' for queue: 'weblogic.kernel.Default (self-tuning)'" #31 daemon prio=5 os_prio=0 tid=0x00007f5d8c9a4800 nid=0x2005 waiting on condition [0x00007f5d7e184000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) parking to wait for <0x00000005d72b1140> (a java.util.concurrent.locks.StampedLock) at java.util.concurrent.locks.StampedLock.acquireRead(StampedLock.java:1215) at java.util.concurrent.locks.StampedLock.readLock(StampedLock.java:428) at org.infinispan.container.offheap.OffHeapConcurrentMap.peekOrGet(OffHeapConcurrentMap.java:615)

            Hello, Will Burns!

            I have no any listeners in my code. I use Infinispan very simple: just put and get objects.
            My app and Infinispan are running on Oracle cloud and using Oracle JDK.
            I tried to find in the logs StackOverflowException, but it is not in them.

            Best regards,
            Eugeniy

            Evgenii Balakhonov (Inactive) added a comment - Hello, Will Burns! I have no any listeners in my code. I use Infinispan very simple: just put and get objects. My app and Infinispan are running on Oracle cloud and using Oracle JDK. I tried to find in the logs StackOverflowException, but it is not in them. Best regards, Eugeniy

            Will Burns added a comment -

            Also another question is do you use any listeners in your code? More specifically an eviction listener?

            Will Burns added a comment - Also another question is do you use any listeners in your code? More specifically an eviction listener?

            Will Burns added a comment -

            c0ff75 From the stack traces at this point it appears this was caused by a StackOverflowException in conjunction with https://bugs.openjdk.java.net/browse/JDK-8028686. Do you have the output of the server or see any StackOverflowException in your logs?

            Will Burns added a comment - c0ff75 From the stack traces at this point it appears this was caused by a StackOverflowException in conjunction with https://bugs.openjdk.java.net/browse/JDK-8028686 . Do you have the output of the server or see any StackOverflowException in your logs?

            Hello, Will Burns

            I can reproduce this problem in any time. I can show it if you want, for example via Skype or TeamViewer

            And moreover. If I switch off OFF_HEAP and SYNC (see last attached cache-configuration-jdbc.xml) I have other problem with dead lock:

             "[STUCK] ExecuteThread: '220' for queue: 'weblogic.kernel.Default (self-tuning)'" #812 daemon prio=1 os_prio=0 tid=0x00007ff29c136000 nid=0x4fb8 waiting on condition [0x00007ff1e281f000]
            java.lang.Thread.State: TIMED_WAITING (parking)
            at sun.misc.Unsafe.park(Native Method)

            • parking to wait for <0x00000006ec18b8a8> (a java.util.concurrent.CompletableFuture$Signaller)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709)
              at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
              at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788)
              at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928)
              at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:125)
              at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:36)
              at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:246)

            See attached thread dump "Thread dump - heap storage.txt".

            As a result my WebLogic server going to automatic restart because count of STUCK threads more than critical bound.

            Evgenii Balakhonov (Inactive) added a comment - - edited Hello, Will Burns I can reproduce this problem in any time. I can show it if you want, for example via Skype or TeamViewer And moreover. If I switch off OFF_HEAP and SYNC (see last attached cache-configuration-jdbc.xml) I have other problem with dead lock:  " [STUCK] ExecuteThread: '220' for queue: 'weblogic.kernel.Default (self-tuning)'" #812 daemon prio=1 os_prio=0 tid=0x00007ff29c136000 nid=0x4fb8 waiting on condition [0x00007ff1e281f000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) parking to wait for <0x00000006ec18b8a8> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1709) at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1788) at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1928) at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:125) at org.infinispan.interceptors.impl.SimpleAsyncInvocationStage.get(SimpleAsyncInvocationStage.java:36) at org.infinispan.interceptors.impl.AsyncInterceptorChainImpl.invoke(AsyncInterceptorChainImpl.java:246) See attached thread dump "Thread dump - heap storage.txt". As a result my WebLogic server going to automatic restart because count of STUCK threads more than critical bound.

            Will Burns added a comment -

            c0ff75 Are you certain it doesn't make any progress?

            Will Burns added a comment - c0ff75 Are you certain it doesn't make any progress?

            Will Burns added a comment -

            From the off heap stack traces it appears that it boils down to some threads blocking waiting on the lruLock which is present for the given data container when eviction is enabled.

            However, I couldn't find any thread that would actually be holding the lock and a glance at the source for 11.0.4 all references to the lock have the same try finally unlock idiom which means it should always be unlocked.

            Looking at the on heap one and it seems to be a similar issue that there is no thread holding the lock and just a bunch waiting for it, note our on heap implementation is using Caffeine, which I would be surprised if they had an issue with their locks as well.

            I can try to glance again tomorrow with fresh eyes though

            Will Burns added a comment - From the off heap stack traces it appears that it boils down to some threads blocking waiting on the lruLock which is present for the given data container when eviction is enabled. However, I couldn't find any thread that would actually be holding the lock and a glance at the source for 11.0.4 all references to the lock have the same try finally unlock idiom which means it should always be unlocked. Looking at the on heap one and it seems to be a similar issue that there is no thread holding the lock and just a bunch waiting for it, note our on heap implementation is using Caffeine, which I would be surprised if they had an issue with their locks as well. I can try to glance again tomorrow with fresh eyes though

              Unassigned Unassigned
              c0ff75 Evgenii Balakhonov (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: