Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-10307

EJB client freezes intermittently when invoking a remote bean during cluster topology change (server startup)

XMLWordPrintable

    • Regression

      Seen in our clustering failover tests (scenario shutdown) for remote stateful EJBs.
      Setup: 4 node cluster, one node at time is shut down gracefully, while 1000 standalone clients keep calling the application.

      Description: server perf18 (perf18 is listed just for example - it happens also with any other server) was shut down using native management protocol (ModelControllerClient), equivalent to CLI command ":shutdown(timeout=300). Then it was started again after some time. During server startup (few seconds before JGroups were started and servers got a new cluster view), invocations from clients started to be queued up and no response from a cluster arrived until the perf18 was completely started. Then most of the clients got their response (invocation duration was ~6000, normally invocation takes few few milliseconds). But there are about 20% of clients who did not log anything else after that time and got completely stuck.

      NOTE: We set invocation timeout in the client code to 10 seconds:

      EJBClient.setInvocationTimeout(session, 10, TimeUnit.SECONDS);
      

      Client log:
      http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/console-perf17/

      NOTE: There is extra debugging enabled - each EJB client logs these two messages (in the beginning of the invocation and after it received a response):

      Invoking SFSB: getSerialAndIncrement()
      Response acquired, serial: X delay: Y
      

      You can see the ID number of the EJB client in the log message: [Runner - XXX]

      Timeline:

      perf18 is started again
      2017/04/12 03:17:28:705 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Bringing back node 0 (perf18)
      
      2017/04/12 03:17:33:646 all client requests start to queue up
      2017/04/12 03:17:33:618 EDT [DEBUG][Runner - 806] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 32 delay: 2
      2017/04/12 03:17:33:625 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:627 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2
      2017/04/12 03:17:33:631 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:635 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 4
      2017/04/12 03:17:33:646 EDT [DEBUG][Runner - 175] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:648 EDT [DEBUG][Runner - 331] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:651 EDT [DEBUG][Runner - 488] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:651 EDT [DEBUG][Runner - 966] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:653 EDT [DEBUG][Runner - 807] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:654 EDT [DEBUG][Runner - 965] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:654 EDT [DEBUG][Runner - 648] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:662 EDT [DEBUG][Runner - 490] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:689 EDT [DEBUG][Runner - 808] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 809] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 177] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:690 EDT [DEBUG][Runner - 967] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:33:691 EDT [DEBUG][Runner - 649] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      
      at the same time, perf18 is far from starting JGroups channel...
      [JBossINF] [0m[0m03:17:33,605 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-4) WFLYUT0018: Host default-host starting
      [JBossINF] [0m[0m03:17:33,674 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-8) WFLYUT0006: Undertow AJP listener ajp listening on 10.16.90.54:8009
      [JBossINF] [0m[0m03:17:33,675 INFO  [org.wildfly.extension.undertow] (MSC service thread 1-6) WFLYUT0006: Undertow HTTP listener default listening on 10.16.90.54:8080
      [JBossINF] [0m[33m03:17:33,684 WARN  [org.jboss.as.remoting] (MSC service thread 1-6) ****** All authentication is ANONYMOUS for org.jboss.as.remoting.RemotingHttpUpgradeService
      [JBossINF] [0m[33m03:17:33,688 WARN  [org.jboss.as.remoting] (MSC service thread 1-2) ****** All authentication is ANONYMOUS for org.jboss.as.remoting.RemotingHttpUpgradeService
      [JBossINF] [0m[0m03:17:33,715 INFO  [org.jboss.modcluster] (ServerService Thread Pool -- 68) MODCLUSTER000001: Initializing mod_cluster version 1.3.6.Final-redhat-1
      [JBossINF] [0m[0m03:17:34,001 INFO  [org.jboss.modcluster] (ServerService Thread Pool -- 68) MODCLUSTER000032: Listening to proxy advertisements on /230.3.90.52:23364
      

      perf18 log: http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/console-perf18/

      Let's focus on [Runner 9] - one of the EJB clients which gets stuck.

      The last log message from [Runner 9]
      2017/04/12 03:17:37:156 EDT [DEBUG][Runner - 9] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      

      And finally, after perf18 starts, most of the clients get their responses

      perf18 started
      2017/04/12 03:17:37:600 EDT [DEBUG][Runner - 329] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:37:610 EDT [DEBUG][Runner - 647] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:37:611 EDT [DEBUG][Runner - 964] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:37:612 EDT [DEBUG][Runner - 174] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:37:618 EDT [DEBUG][Runner - 806] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:37:627 EDT [DEBUG][Runner - 330] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:37:636 EDT [DEBUG][Runner - 489] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Invoking SFSB: getSerialAndIncrement().
      2017/04/12 03:17:40:167 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Node 0 (perf18) is up.
      2017/04/12 03:17:40:167 EDT [INFO ][TestController] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Cluster state: [UP, UP, UP, UP].
      2017/04/12 03:17:40:200 EDT [DEBUG][Runner - 168] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 37 delay: 2732
      2017/04/12 03:17:40:201 EDT [DEBUG][Runner - 647] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 2591
      2017/04/12 03:17:40:201 EDT [DEBUG][Runner - 481] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2761
      2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 794] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 33 delay: 2872
      2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 629] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 34 delay: 2968
      2017/04/12 03:17:40:202 EDT [DEBUG][Runner - 315] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 36 delay: 2941
      2017/04/12 03:17:40:203 EDT [DEBUG][Runner - 480] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 35 delay: 2778
      2017/04/12 03:17:40:203 EDT [DEBUG][Runner - 948] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Response acquired, serial: 32 delay: 2968
      

      In this particular run, there were 234 clients which got stuck. See failover report ( number of "active" sessions):
      http://jenkins.hosts.mwqe.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/31/artifact/report/failover.txt

      Summary:
      There are two issues:
      1. Client remote calls get queued up during server startup (they don't during server graceful shutdown)
      2. Some clients get stuck completely after server starts up.

      NOTE: In order to get load balancing working, we had to set Affinity to Affinity.NONE - See JBEAP-9355

              rachmato@redhat.com Richard Achmatowicz
              mvinkler1@redhat.com Michal Vinkler
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: