Uploaded image for project: 'WildFly'
  1. WildFly
  2. WFLY-6926

TimeoutException: Replication timeout when handling request with DIST cache

    XMLWordPrintable

Details

    Description

      Seen in http-session scenarios with DIST cache only. REPL cache scenarios seem not to be affected.

      TimeoutExceptions occur right away after sending first request and definitely result in client getting 500:

      [JBossINF] [0m[31m04:31:39,033 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (default task-1) ISPN000136: Error executing command GetKeyValueCommand, writing keys []: org.infinispan.util.concurrent.TimeoutException: Replication timeout for perf19
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:801)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$1(JGroupsTransport.java:642)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
      [JBossINF] 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.staggeredProcessNext(CommandAwareRpcDispatcher.java:375)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.lambda$processCallsStaggered$3(CommandAwareRpcDispatcher.java:357)
      [JBossINF] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      [JBossINF] 	at java.lang.Thread.run(Thread.java:745)
      

      Client:

      2016/08/08 04:31:39:068 EDT [WARN ][Runner - 4] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  Internal Server Error>
              org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code: 500 Content: Internal Server Error
      	at org.jboss.smartfrog.loaddriver.http.HttpRequestProcessorFactoryImpl$HttpRequestProcessor.processRequest(HttpRequestProcessorFactoryImpl.java:163)
      	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
      	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
      	at java.lang.Thread.run(Thread.java:745)
      

      I was able to reproduce it even with a very small load of 30 sessions.

      Server link (run with 30 sessions):
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf21/consoleText

      Client link (run with 30 sessions):
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf17/consoleText

      This issue occurs under different conditions than JBEAP-3779 (which is similar), that's why I opened new issue.
      Number of occurrences of TimeoutExceptions is very high. The server logs are full of them.
      This issue makes clustering with DIST cache unusable, thus giving blocker priority.

      Also, there is a issue with shutting down the servers in the end of the test, I don't know if it can be related, but I am hitting it constantly in the http-session DIST scenarios.
      The server shutdown usually gets stuck on this command:

      [JBossINF] [0m[0m04:38:51,694 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-7) ISPN000082: Stopping the RpcDispatcher for channel web
      

      EDIT:
      Also ejb-remote tests with DIST cache are affected (even if only 20 sessions are created).

      Server log for ejb-remote run:
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf20/

      Client log for ejb-remote run:
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf17/

      EDIT #2:
      Can be very easily reproduced locally. Just start a cluster of at least 3 servers with clusterbench deployed. Then do few requests to various nodes until you get 500 as a response. Also notice, when shutting down a node, which logged TimeoutException before, the node won't stop and get stuck during the shutdown.

      Attachments

        Issue Links

          Activity

            Public project attachment banner

              context keys: [headless, issue, helper, isAsynchronousRequest, project, action, user]
              current Project key: WFLY

              People

                pferraro@redhat.com Paul Ferraro
                mvinkler1@redhat.com Michal Vinkler
                Votes:
                1 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                  Created:
                  Updated:
                  Resolved: