-
Bug
-
Resolution: Done
-
Blocker
-
7.1.0.DR16
-
Regression
-
-
-
-
-
-
Seen in our clustering failover tests (scenario shutdown) for remote stateful EJBs.
Setup: 4 node cluster, one node at time is shut down gracefully, while 1000 standalone clients keep calling the application.
Description: server perf18 (perf18 is listed just for example - it happens also with any other server) was shut down using native management protocol (ModelControllerClient), equivalent to CLI command ":shutdown(timeout=300). Then it was started again after some time. During server startup (few seconds before JGroups were started and servers got a new cluster view), invocations from clients started to be queued up and no response from a cluster arrived until the perf18 was completely started. Then most of the clients got their response (invocation duration was ~6000, normally invocation takes few few milliseconds). But there are about 20% of clients who did not log anything else after that time and got completely stuck.
NOTE: We set invocation timeout in the client code to 10 seconds:
EJBClient.setInvocationTimeout(session, 10, TimeUnit.SECONDS);
NOTE: There is extra debugging enabled - each EJB client logs these two messages (in the beginning of the invocation and after it received a response):
Invoking SFSB: getSerialAndIncrement() Response acquired, serial: X delay: Y
You can see the ID number of the EJB client in the log message: [Runner - XXX]
Timeline:
2017/04/12 03:17:28:705 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Bringing back node 0 (perf18)
2017/04/12 03:17:33:618 EDT [DEBUG][Runner - 806] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 32 delay: 2 2017/04/12 03:17:33:625 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:627 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2 2017/04/12 03:17:33:631 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:635 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 4 2017/04/12 03:17:33:646 EDT [DEBUG][Runner - 175] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:648 EDT [DEBUG][Runner - 331] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:651 EDT [DEBUG][Runner - 488] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:651 EDT [DEBUG][Runner - 966] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:653 EDT [DEBUG][Runner - 807] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:654 EDT [DEBUG][Runner - 965] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:654 EDT [DEBUG][Runner - 648] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:662 EDT [DEBUG][Runner - 490] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:689 EDT [DEBUG][Runner - 808] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 809] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 177] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 967] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:33:691 EDT [DEBUG][Runner - 649] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
[JBossINF] [0m[0m03:17:33,605 INFO [org.wildfly.extension.undertow] (MSC service thread 1-4) WFLYUT0018: Host default-host starting [JBossINF] [0m[0m03:17:33,674 INFO [org.wildfly.extension.undertow] (MSC service thread 1-8) WFLYUT0006: Undertow AJP listener ajp listening on 10.16.90.54:8009 [JBossINF] [0m[0m03:17:33,675 INFO [org.wildfly.extension.undertow] (MSC service thread 1-6) WFLYUT0006: Undertow HTTP listener default listening on 10.16.90.54:8080 [JBossINF] [0m[33m03:17:33,684 WARN [org.jboss.as.remoting] (MSC service thread 1-6) ****** All authentication is ANONYMOUS for org.jboss.as.remoting.RemotingHttpUpgradeService [JBossINF] [0m[33m03:17:33,688 WARN [org.jboss.as.remoting] (MSC service thread 1-2) ****** All authentication is ANONYMOUS for org.jboss.as.remoting.RemotingHttpUpgradeService [JBossINF] [0m[0m03:17:33,715 INFO [org.jboss.modcluster] (ServerService Thread Pool -- 68) MODCLUSTER000001: Initializing mod_cluster version 1.3.6.Final-redhat-1 [JBossINF] [0m[0m03:17:34,001 INFO [org.jboss.modcluster] (ServerService Thread Pool -- 68) MODCLUSTER000032: Listening to proxy advertisements on /230.3.90.52:23364
Let's focus on [Runner 9] - one of the EJB clients which gets stuck.
2017/04/12 03:17:37:156 EDT [DEBUG][Runner - 9] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
And finally, after perf18 starts, most of the clients get their responses
2017/04/12 03:17:37:600 EDT [DEBUG][Runner - 329] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:37:610 EDT [DEBUG][Runner - 647] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:37:611 EDT [DEBUG][Runner - 964] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:37:612 EDT [DEBUG][Runner - 174] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:37:618 EDT [DEBUG][Runner - 806] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:37:627 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:37:636 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement(). 2017/04/12 03:17:40:167 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is up. 2017/04/12 03:17:40:167 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Cluster state: [UP, UP, UP, UP]. 2017/04/12 03:17:40:200 EDT [DEBUG][Runner - 168] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 37 delay: 2732 2017/04/12 03:17:40:201 EDT [DEBUG][Runner - 647] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 2591 2017/04/12 03:17:40:201 EDT [DEBUG][Runner - 481] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2761 2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 794] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 33 delay: 2872 2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 629] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 2968 2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 315] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 36 delay: 2941 2017/04/12 03:17:40:203 EDT [DEBUG][Runner - 480] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2778 2017/04/12 03:17:40:203 EDT [DEBUG][Runner - 948] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 32 delay: 2968
In this particular run, there were 234 clients which got stuck. See failover report ( number of "active" sessions):
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/artifact/report/failover.txt
Summary:
There are two issues:
1. Client remote calls get queued up during server startup (they don't during server graceful shutdown)
2. Some clients get stuck completely after server starts up.
NOTE: In order to get load balancing working, we had to set Affinity to Affinity.NONE - See JBEAP-9355