Uploaded image for project: 'WildFly'
  1. WildFly
  2. WFLY-6926

TimeoutException: Replication timeout when handling request with DIST cache

XMLWordPrintable

      Seen in http-session scenarios with DIST cache only. REPL cache scenarios seem not to be affected.

      TimeoutExceptions occur right away after sending first request and definitely result in client getting 500:

      [JBossINF] [0m[31m04:31:39,033 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (default task-1) ISPN000136: Error executing command GetKeyValueCommand, writing keys []: org.infinispan.util.concurrent.TimeoutException: Replication timeout for perf19
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:801)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$1(JGroupsTransport.java:642)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
      [JBossINF] 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
      [JBossINF] 	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.staggeredProcessNext(CommandAwareRpcDispatcher.java:375)
      [JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.lambda$processCallsStaggered$3(CommandAwareRpcDispatcher.java:357)
      [JBossINF] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      [JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      [JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      [JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      [JBossINF] 	at java.lang.Thread.run(Thread.java:745)
      

      Client:

      2016/08/08 04:31:39:068 EDT [WARN ][Runner - 4] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  Internal Server Error>
              org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code: 500 Content: Internal Server Error
      	at org.jboss.smartfrog.loaddriver.http.HttpRequestProcessorFactoryImpl$HttpRequestProcessor.processRequest(HttpRequestProcessorFactoryImpl.java:163)
      	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
      	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
      	at java.lang.Thread.run(Thread.java:745)
      

      I was able to reproduce it even with a very small load of 30 sessions.

      Server link (run with 30 sessions):
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf21/consoleText

      Client link (run with 30 sessions):
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf17/consoleText

      This issue occurs under different conditions than JBEAP-3779 (which is similar), that's why I opened new issue.
      Number of occurrences of TimeoutExceptions is very high. The server logs are full of them.
      This issue makes clustering with DIST cache unusable, thus giving blocker priority.

      Also, there is a issue with shutting down the servers in the end of the test, I don't know if it can be related, but I am hitting it constantly in the http-session DIST scenarios.
      The server shutdown usually gets stuck on this command:

      [JBossINF] [0m[0m04:38:51,694 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-7) ISPN000082: Stopping the RpcDispatcher for channel web
      

      EDIT:
      Also ejb-remote tests with DIST cache are affected (even if only 20 sessions are created).

      Server log for ejb-remote run:
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf20/

      Client log for ejb-remote run:
      http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf17/

      EDIT #2:
      Can be very easily reproduced locally. Just start a cluster of at least 3 servers with clusterbench deployed. Then do few requests to various nodes until you get 500 as a response. Also notice, when shutting down a node, which logged TimeoutException before, the node won't stop and get stuck during the shutdown.

              pferraro@redhat.com Paul Ferraro
              mvinkler1@redhat.com Michal Vinkler
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: