Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 5.9.1.Final
Affects Version/s: 5.9.0.Final
Component/s: Tooling
Labels:
None

Steps to Reproduce:
Hide

This could be simulated for WFLY when it should be stopped. WildFly :shutdown command causes suspend being invoked and then JVM is about to be closed.

Periodic recovery is active and running - for processing both phases it needs more than 20 seconds which is the default socket timeout for PeriodicMonitor call

Client requests SCAN, the default timeout for the socket is 20 seconds[5]
java -cp modules/system/layers/base/org/jboss/jts/main/narayana-jts-idlj-5.9.0.Final.jar com.arjuna.ats.arjuna.tools.RecoveryMonitor -host localhost -port 4712

ServiceWorker is started and waits

20 seconds elapsed, client gets "java.net.SocketTimeoutException: Read timed out"

(recovery needs to be still in progress while stopping the server)

server is asked to stop ./bin/jboss-cli.sh --command=':shutdown()' -c which causes suspend is called on the recovery manager and we have the deadlock
Show
This could be simulated for WFLY when it should be stopped. WildFly :shutdown command causes suspend being invoked and then JVM is about to be closed. Periodic recovery is active and running - for processing both phases it needs more than 20 seconds which is the default socket timeout for PeriodicMonitor call Client requests SCAN, the default timeout for the socket is 20 seconds [5] java -cp modules/system/layers/base/org/jboss/jts/main/narayana-jts-idlj-5.9.0.Final.jar com.arjuna.ats.arjuna.tools.RecoveryMonitor -host localhost -port 4712 ServiceWorker is started and waits 20 seconds elapsed, client gets "java.net.SocketTimeoutException: Read timed out" (recovery needs to be still in progress while stopping the server) server is asked to stop ./bin/jboss-cli.sh --command=':shutdown()' -c which causes suspend is called on the recovery manager and we have the deadlock
Git Pull Request:
https://github.com/jbosstm/narayana/pull/1353

If PeriodicMonitor has active unfinished request to start recovery scan

INFO  [com.arjuna.ats.arjuna] (Server.Connection:127.0.0.1:36326) ARJUNA012340: RecoveryManager scan scheduled to begin.

and meanwhile the RecoveryManager is suspended then the WorkerService is waiting in deadlock forever and JVM can't be cleanly stopped.

The whole stopping EAP server stacktrace could be seen at https://issues.jboss.org/browse/CLOUD-2767 the related Narayana related threads are[10].

It seems to me as the recovery was put to suspended[1] the processing thread waits for the state change[2]. The state change to set to terminated could be done by periodic recovery shutdown[3] but it's stuck on waiting for listners to be stopped[4]. Stopping the listeners could be done by the thread waiting for the state change[5].

[1] https://github.com/wildfly/wildfly/blob/master/transactions/src/main/java/org/jboss/as/txn/suspend/RecoverySuspendController.java#L41
[2] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L348
[3] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L177
[4] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L169
[5] https://github.com/jbosstm/narayana/blob/master/ArjunaCore/arjuna/classes/com/arjuna/ats/internal/arjuna/recovery/PeriodicRecovery.java#L387

[10]

"Transaction Reaper Worker 0" #121 daemon prio=5 os_prio=0 tid=0x000000000197a000 nid=0x2b6 in Object.wait() [0x00007f3661d2b000]
   java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on <0x00000000b289c310> (a java.util.LinkedList)
 at java.lang.Object.wait(Object.java:502)
 at com.arjuna.ats.arjuna.coordinator.TransactionReaper.waitForCancellations(TransactionReaper.java:328)
 - locked <0x00000000b289c310> (a java.util.LinkedList)
 at com.arjuna.ats.internal.arjuna.coordinator.ReaperWorkerThread.run(ReaperWorkerThread.java:65)
   Locked ownable synchronizers:
 - None

"Transaction Reaper" #120 daemon prio=5 os_prio=0 tid=0x0000000001b4b800 nid=0x2b5 in Object.wait() [0x00007f3661e2c000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 - waiting on <0x00000000b289c330> (a com.arjuna.ats.arjuna.coordinator.TransactionReaper)
 at com.arjuna.ats.internal.arjuna.coordinator.ReaperThread.run(ReaperThread.java:90)
 - locked <0x00000000b289c330> (a com.arjuna.ats.arjuna.coordinator.TransactionReaper)
   Locked ownable synchronizers:
 - None


"Server.Connection:10.130.0.162:46200" #335 daemon prio=5 os_prio=0 tid=0x0000000003853800 nid=0x449 in Object.wait() [0x00007f365ec2c000]
   java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at com.arjuna.ats.internal.arjuna.recovery.WorkerService.doWork(WorkerService.java:95)
 - locked <0x00000000b2916608> (a com.arjuna.ats.internal.arjuna.recovery.WorkerService)
 at com.arjuna.ats.internal.arjuna.recovery.Connection.run(Connection.java:88)
   Locked ownable synchronizers:
 - None

"Server.Connection:10.130.0.162:46090" #225 daemon prio=5 os_prio=0 tid=0x0000000003a4c800 nid=0x3ca in Object.wait() [0x00007f365c2ef000]
   java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at com.arjuna.ats.internal.arjuna.recovery.WorkerService.doWork(WorkerService.java:95)
 - locked <0x00000000b2916608> (a com.arjuna.ats.internal.arjuna.recovery.WorkerService)
 at com.arjuna.ats.internal.arjuna.recovery.Connection.run(Connection.java:88)
   Locked ownable synchronizers:
 - None

"MSC service thread 1-5" #19 prio=5 os_prio=0 tid=0x0000000001903000 nid=0x251 in Object.wait() [0x00007f366aebe000]
   java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at com.arjuna.ats.internal.arjuna.recovery.Listener.stopListener(Listener.java:222)
 - locked <0x00000000b29161d0> (a com.arjuna.ats.internal.arjuna.recovery.Listener)
 at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.shutdown(PeriodicRecovery.java:169)
 at com.arjuna.ats.internal.arjuna.recovery.RecoveryManagerImple.stop(RecoveryManagerImple.java:179)
 at com.arjuna.ats.arjuna.recovery.RecoveryManager.terminate(RecoveryManager.java:210)
 - locked <0x00000000b0badbb8> (a com.arjuna.ats.arjuna.recovery.RecoveryManager)
 at com.arjuna.ats.arjuna.recovery.RecoveryManager.terminate(RecoveryManager.java:194)
 at com.arjuna.ats.jbossatx.jta.RecoveryManagerService.stop(RecoveryManagerService.java:73)
 at org.jboss.as.txn.service.ArjunaRecoveryManagerService.stop(ArjunaRecoveryManagerService.java:167)
 - locked <0x00000000b313ddb0> (a org.jboss.as.txn.service.ArjunaRecoveryManagerService)
 at org.jboss.msc.service.ServiceControllerImpl$StopTask.stopService(ServiceControllerImpl.java:2150)
 at org.jboss.msc.service.ServiceControllerImpl$StopTask.run(ServiceControllerImpl.java:2101)
 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
 at java.lang.Thread.run(Thread.java:748)
   Locked ownable synchronizers:
 - <0x00000000b351b500> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"Periodic Recovery" #116 prio=5 os_prio=0 tid=0x0000000002357000 nid=0x2b4 in Object.wait() [0x00007f3661f2d000]
   java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:502)
 at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.doSuspendedWait(PeriodicRecovery.java:709)
 at com.arjuna.ats.internal.arjuna.recovery.PeriodicRecovery.run(PeriodicRecovery.java:346)
 - locked <0x00000000b289c898> (a java.lang.Object)
   Locked ownable synchronizers:
 - None

NOTE: there is interesting point there are two WorkerService instances in the stack trace. It's causes two sequential requests hits the RecoveryManager while the periodic recovery was in progress before the server was shutdown. The code invoking the two calls and then immediately running the shutdown was https://github.com/jboss-openshift/cct_module/blob/7f891cca882f86f983b298b288fe0b5c12bc51bf/os-eap-migration/added/launch/openshift-migrate-common.sh#L50

causes

CLOUD-2767 [EAP][XA] EAP in migration pod hangs during shutdown on GlusterFS

Verified

is cloned by

JBEAP-16195 (7.2.z) JBTM-3047 - Suspending recovery manager causes deadlock when acive RecoveryMonitor scan request exists

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates