Uploaded image for project: 'JBoss Transaction Manager'
  1. JBoss Transaction Manager
  2. JBTM-3047

Suspending recovery manager causes deadlock when acive RecoveryMonitor scan request exists

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 5.9.1.Final
    • 5.9.0.Final
    • Tooling
    • None
    • Hide

      This could be simulated for WFLY when it should be stopped. WildFly :shutdown command causes suspend being invoked and then JVM is about to be closed.

      • Periodic recovery is active and running - for processing both phases it needs more than 20 seconds which is the default socket timeout for PeriodicMonitor call
      • Client requests SCAN, the default timeout for the socket is 20 seconds[5]
        java -cp modules/system/layers/base/org/jboss/jts/main/narayana-jts-idlj-5.9.0.Final.jar com.arjuna.ats.arjuna.tools.RecoveryMonitor -host localhost -port 4712
      • ServiceWorker is started and waits
      • 20 seconds elapsed, client gets "java.net.SocketTimeoutException: Read timed out"
      • (recovery needs to be still in progress while stopping the server)
      • server is asked to stop ./bin/jboss-cli.sh --command=':shutdown()' -c which causes suspend is called on the recovery manager and we have the deadlock
      Show
      This could be simulated for WFLY when it should be stopped. WildFly :shutdown command causes suspend being invoked and then JVM is about to be closed. Periodic recovery is active and running - for processing both phases it needs more than 20 seconds which is the default socket timeout for PeriodicMonitor call Client requests SCAN, the default timeout for the socket is 20 seconds [5] java -cp modules/system/layers/base/org/jboss/jts/main/narayana-jts-idlj-5.9.0.Final.jar com.arjuna.ats.arjuna.tools.RecoveryMonitor -host localhost -port 4712 ServiceWorker is started and waits 20 seconds elapsed, client gets "java.net.SocketTimeoutException: Read timed out" (recovery needs to be still in progress while stopping the server) server is asked to stop ./bin/jboss-cli.sh --command=':shutdown()' -c which causes suspend is called on the recovery manager and we have the deadlock

    Description

      If PeriodicMonitor has active unfinished request to start recovery scan

      INFO  [com.arjuna.ats.arjuna] (Server.Connection:127.0.0.1:36326) ARJUNA012340: RecoveryManager scan scheduled to begin.
      

      and meanwhile the RecoveryManager is suspended then the WorkerService is waiting in deadlock forever and JVM can't be cleanly stopped.

      The whole stopping EAP server stacktrace could be seen at https://issues.jboss.org/browse/CLOUD-2767 the related Narayana related threads are[10].

      It seems to me as the recovery was put to suspended[1] the processing thread waits for the state change[2]. The state change to set to terminated could be done by periodic recovery shutdown[3] but it's stuck on waiting for listners to be stopped[4]. Stopping the listeners could be done by the thread waiting for the state change[5].

      [1] https://github.com/wildfly/wildfly/blob/master/transactions/src/main/java/org/jboss/as/txn/suspend/RecoverySuspendController.java#L41
      [2] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L348
      [3] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L177
      [4] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L169
      [5] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L387

      [10]

      "Transaction Reaper Worker 0" #121 daemon prio=5 os_prio=0 tid=0x000000000197a000 nid=0x2b6 in Object.wait() [0x00007f3661d2b000]
         java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(Native Method)
       - waiting on <0x00000000b289c310> (a java.util.LinkedList)
       at java.lang.Object.wait(Object.java:502)
       at com.arjuna.ats.arjuna.coordinator.TransactionReaper.waitForCancellations(TransactionReaper.java:328)
       - locked <0x00000000b289c310> (a java.util.LinkedList)
       at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:65)
         Locked ownable synchronizers:
       - None
      
      "Transaction Reaper" #120 daemon prio=5 os_prio=0 tid=0x0000000001b4b800 nid=0x2b5 in Object.wait() [0x00007f3661e2c000]
         java.lang.Thread.State: TIMED_WAITING (on object monitor)
       at java.lang.Object.wait(Native Method)
       - waiting on <0x00000000b289c330> (a com.arjuna.ats.arjuna.coordinator.TransactionReaper)
       at com.arjuna.ats.internal.arjuna.coordinator.ReaperThread.run(ReaperThread.java:90)
       - locked <0x00000000b289c330> (a com.arjuna.ats.arjuna.coordinator.TransactionReaper)
         Locked ownable synchronizers:
       - None
      
      
      "Server.Connection:10.130.0.162:46200" #335 daemon prio=5 os_prio=0 tid=0x0000000003853800 nid=0x449 in Object.wait() [0x00007f365ec2c000]
         java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(Native Method)
       at java.lang.Object.wait(Object.java:502)
       at com.arjuna.ats.internal.arjuna.recovery.WorkerService.doWork(WorkerService.java:95)
       - locked <0x00000000b2916608> (a com.arjuna.ats.internal.arjuna.recovery.WorkerService)
       at com.arjuna.ats.internal.arjuna.recovery.Connection.run(Connection.java:88)
         Locked ownable synchronizers:
       - None
      
      "Server.Connection:10.130.0.162:46090" #225 daemon prio=5 os_prio=0 tid=0x0000000003a4c800 nid=0x3ca in Object.wait() [0x00007f365c2ef000]
         java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(Native Method)
       at java.lang.Object.wait(Object.java:502)
       at com.arjuna.ats.internal.arjuna.recovery.WorkerService.doWork(WorkerService.java:95)
       - locked <0x00000000b2916608> (a com.arjuna.ats.internal.arjuna.recovery.WorkerService)
       at com.arjuna.ats.internal.arjuna.recovery.Connection.run(Connection.java:88)
         Locked ownable synchronizers:
       - None
      
      "MSC service thread 1-5" #19 prio=5 os_prio=0 tid=0x0000000001903000 nid=0x251 in Object.wait() [0x00007f366aebe000]
         java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(Native Method)
       at java.lang.Object.wait(Object.java:502)
       at com.arjuna.ats.internal.arjuna.recovery.Listener.stopListener(Listener.java:222)
       - locked <0x00000000b29161d0> (a com.arjuna.ats.internal.arjuna.recovery.Listener)
       at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.shutdown(PeriodicRecovery.java:169)
       at com.arjuna.ats.internal.arjuna.recovery.RecoveryManagerImple.stop(RecoveryManagerImple.java:179)
       at com.arjuna.ats.arjuna.recovery.RecoveryManager.terminate(RecoveryManager.java:210)
       - locked <0x00000000b0badbb8> (a com.arjuna.ats.arjuna.recovery.RecoveryManager)
       at com.arjuna.ats.arjuna.recovery.RecoveryManager.terminate(RecoveryManager.java:194)
       at com.arjuna.ats.jbossatx.jta.RecoveryManagerService.stop(RecoveryManagerService.java:73)
       at org.jboss.as.txn.service.ArjunaRecoveryManagerService.stop(ArjunaRecoveryManagerService.java:167)
       - locked <0x00000000b313ddb0> (a org.jboss.as.txn.service.ArjunaRecoveryManagerService)
       at org.jboss.msc.service.ServiceControllerImpl$StopTask.stopService(ServiceControllerImpl.java:2150)
       at org.jboss.msc.service.ServiceControllerImpl$StopTask.run(ServiceControllerImpl.java:2101)
       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       at java.lang.Thread.run(Thread.java:748)
         Locked ownable synchronizers:
       - <0x00000000b351b500> (a java.util.concurrent.ThreadPoolExecutor$Worker)
      
      "Periodic Recovery" #116 prio=5 os_prio=0 tid=0x0000000002357000 nid=0x2b4 in Object.wait() [0x00007f3661f2d000]
         java.lang.Thread.State: WAITING (on object monitor)
       at java.lang.Object.wait(Native Method)
       at java.lang.Object.wait(Object.java:502)
       at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doSuspendedWait(PeriodicRecovery.java:709)
       at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:346)
       - locked <0x00000000b289c898> (a java.lang.Object)
         Locked ownable synchronizers:
       - None
      

      NOTE: there is interesting point there are two WorkerService instances in the stack trace. It's causes two sequential requests hits the RecoveryManager while the periodic recovery was in progress before the server was shutdown. The code invoking the two calls and then immediately running the shutdown was https://github.com/jboss-openshift/cct_module/blob/7f891cca882f86f983b298b288fe0b5c12bc51bf/os-eap-migration/added/launch/openshift-migrate-common.sh#L50

      Attachments

        Issue Links

          Activity

            People

              ochaloup@redhat.com Ondrej Chaloupka (Inactive)
              ochaloup@redhat.com Ondrej Chaloupka (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: