Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: 7.1.0.DR19
Affects Version/s: 7.1.0.DR16
Component/s: Clustering, EJB
Labels:
- eap71_beta

Affects Testing:

Regression
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Target Release:

7.1.0.GA

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Seen in our clustering failover tests (scenario shutdown) for remote stateful EJBs.
Setup: 4 node cluster, one node at time is shut down gracefully, while 1000 standalone clients keep calling the application.

Description: server perf18 (perf18 is listed just for example - it happens also with any other server) was shut down using native management protocol (ModelControllerClient), equivalent to CLI command ":shutdown(timeout=300). Then it was started again after some time. During server startup (few seconds before JGroups were started and servers got a new cluster view), invocations from clients started to be queued up and no response from a cluster arrived until the perf18 was completely started. Then most of the clients got their response (invocation duration was ~6000, normally invocation takes few few milliseconds). But there are about 20% of clients who did not log anything else after that time and got completely stuck.

NOTE: We set invocation timeout in the client code to 10 seconds:

EJBClient.setInvocationTimeout(session, 10, TimeUnit.SECONDS);

Client log:
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/console-perf17/

NOTE: There is extra debugging enabled - each EJB client logs these two messages (in the beginning of the invocation and after it received a response):

Invoking SFSB: getSerialAndIncrement()
Response acquired, serial: X delay: Y

You can see the ID number of the EJB client in the log message: [Runner - XXX]

Timeline:

perf18 is started again

2017/04/12 03:17:28:705 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Bringing back node 0 (perf18)

2017/04/12 03:17:33:646 all client requests start to queue up

2017/04/12 03:17:33:618 EDT [DEBUG][Runner - 806] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 32 delay: 2
2017/04/12 03:17:33:625 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:627 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2
2017/04/12 03:17:33:631 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:635 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 4
2017/04/12 03:17:33:646 EDT [DEBUG][Runner - 175] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:648 EDT [DEBUG][Runner - 331] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:651 EDT [DEBUG][Runner - 488] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:651 EDT [DEBUG][Runner - 966] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:653 EDT [DEBUG][Runner - 807] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:654 EDT [DEBUG][Runner - 965] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:654 EDT [DEBUG][Runner - 648] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:662 EDT [DEBUG][Runner - 490] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:689 EDT [DEBUG][Runner - 808] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 809] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 177] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 967] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:33:691 EDT [DEBUG][Runner - 649] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().

at the same time, perf18 is far from starting JGroups channel...

[JBossINF] [0m[0m03:17:33,605 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-4) WFLYUT0018: Host default-host starting
[JBossINF] [0m[0m03:17:33,674 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-8) WFLYUT0006: Undertow AJP listener ajp listening on 10.16.90.54:8009
[JBossINF] [0m[0m03:17:33,675 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) WFLYUT0006: Undertow HTTP listener default listening on 10.16.90.54:8080
[JBossINF] [0m[33m03:17:33,684 WARN  [org.jboss.as.remoting] (MSC service thread 1-6) ****** All authentication is ANONYMOUS for org.jboss.as.remoting.RemotingHttpUpgradeService
[JBossINF] [0m[33m03:17:33,688 WARN  [org.jboss.as.remoting] (MSC service thread 1-2) ****** All authentication is ANONYMOUS for org.jboss.as.remoting.RemotingHttpUpgradeService
[JBossINF] [0m[0m03:17:33,715 INFO  [org.jboss.modcluster] (ServerService Thread Pool -- 68) MODCLUSTER000001: Initializing mod_cluster version 1.3.6.Final-redhat-1
[JBossINF] [0m[0m03:17:34,001 INFO  [org.jboss.modcluster] (ServerService Thread Pool -- 68) MODCLUSTER000032: Listening to proxy advertisements on /230.3.90.52:23364

perf18 log: http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/console-perf18/

Let's focus on [Runner 9] - one of the EJB clients which gets stuck.

The last log message from [Runner 9]

2017/04/12 03:17:37:156 EDT [DEBUG][Runner - 9] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().

And finally, after perf18 starts, most of the clients get their responses

perf18 started

2017/04/12 03:17:37:600 EDT [DEBUG][Runner - 329] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:37:610 EDT [DEBUG][Runner - 647] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:37:611 EDT [DEBUG][Runner - 964] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:37:612 EDT [DEBUG][Runner - 174] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:37:618 EDT [DEBUG][Runner - 806] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:37:627 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:37:636 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
2017/04/12 03:17:40:167 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is up.
2017/04/12 03:17:40:167 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Cluster state: [UP, UP, UP, UP].
2017/04/12 03:17:40:200 EDT [DEBUG][Runner - 168] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 37 delay: 2732
2017/04/12 03:17:40:201 EDT [DEBUG][Runner - 647] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 2591
2017/04/12 03:17:40:201 EDT [DEBUG][Runner - 481] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2761
2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 794] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 33 delay: 2872
2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 629] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 2968
2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 315] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 36 delay: 2941
2017/04/12 03:17:40:203 EDT [DEBUG][Runner - 480] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2778
2017/04/12 03:17:40:203 EDT [DEBUG][Runner - 948] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 32 delay: 2968

In this particular run, there were 234 clients which got stuck. See failover report ( number of "active" sessions):
http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/artifact/report/failover.txt

Summary:
There are two issues:
1. Client remote calls get queued up during server startup (they don't during server graceful shutdown)
2. Some clients get stuck completely after server starts up.

NOTE: In order to get load balancing working, we had to set Affinity to Affinity.NONE - See ~~JBEAP-9355~~

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

invocations-blocked-until-restart.txt
224 kB
2017/05/03 9:00 AM
thread_dump.zip
212 kB
2017/04/26 10:19 AM

Assignee:: Richard Achmatowicz

Reporter:: Michal Vinkler

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2017/04/12 4:40 AM

Updated:: 2024/09/02 3:53 PM

Resolved:: 2017/05/24 3:48 PM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates