Loading...

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: 10.1.0.Final, 11.0.0.Alpha1
Affects Version/s: 10.1.0.CR1
Component/s: Clustering
Labels:
None

Git Pull Request:
https://github.com/wildfly/wildfly/pull/9125, https://github.com/wildfly/wildfly/pull/9124

Seen in http-session scenarios with DIST cache only. REPL cache scenarios seem not to be affected.

TimeoutExceptions occur right away after sending first request and definitely result in client getting 500:

[JBossINF] [0m[31m04:31:39,033 ERROR [org.infinispan.interceptors.InvocationContextInterceptor] (default task-1) ISPN000136: Error executing command GetKeyValueCommand, writing keys []: org.infinispan.util.concurrent.TimeoutException: Replication timeout for perf19
[JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:801)
[JBossINF] 	at org.infinispan.remoting.transport.jgroups.JGroupsTransport.lambda$invokeRemotelyAsync$1(JGroupsTransport.java:642)
[JBossINF] 	at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
[JBossINF] 	at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
[JBossINF] 	at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
[JBossINF] 	at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
[JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.staggeredProcessNext(CommandAwareRpcDispatcher.java:375)
[JBossINF] 	at org.infinispan.remoting.transport.jgroups.CommandAwareRpcDispatcher.lambda$processCallsStaggered$3(CommandAwareRpcDispatcher.java:357)
[JBossINF] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[JBossINF] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
[JBossINF] 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
[JBossINF] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[JBossINF] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[JBossINF] 	at java.lang.Thread.run(Thread.java:745)

Client:

2016/08/08 04:31:39:068 EDT [WARN ][Runner - 4] HOST perf17.mw.lab.eng.bos.redhat.com:rootProcess:c - Error sampling data:  Internal Server Error>
        org.jboss.smartfrog.loaddriver.RequestProcessingException: Invalid response code: 500 Content: Internal Server Error
	at org.jboss.smartfrog.loaddriver.http.HttpRequestProcessorFactoryImpl$HttpRequestProcessor.processRequest(HttpRequestProcessorFactoryImpl.java:163)
	at org.jboss.smartfrog.loaddriver.CompoundRequestProcessorFactoryImpl$CompoundRequestProcessor.processRequest(CompoundRequestProcessorFactoryImpl.java:52)
	at org.jboss.smartfrog.loaddriver.Runner.run(Runner.java:103)
	at java.lang.Thread.run(Thread.java:745)

I was able to reproduce it even with a very small load of 30 sessions.

Server link (run with 30 sessions):
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf21/consoleText

Client link (run with 30 sessions):
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-http-session-jvmkill-dist-sync/27/console-perf17/consoleText

This issue occurs under different conditions than ~~JBEAP-3779~~ (which is similar), that's why I opened new issue.
Number of occurrences of TimeoutExceptions is very high. The server logs are full of them.
This issue makes clustering with DIST cache unusable, thus giving blocker priority.

Also, there is a issue with shutting down the servers in the end of the test, I don't know if it can be related, but I am hitting it constantly in the http-session DIST scenarios.
The server shutdown usually gets stuck on this command:

[JBossINF] [0m[0m04:38:51,694 INFO  [org.infinispan.remoting.transport.jgroups.JGroupsTransport] (MSC service thread 1-7) ISPN000082: Stopping the RpcDispatcher for channel web

EDIT:
Also ejb-remote tests with DIST cache are affected (even if only 20 sessions are created).

Server log for ejb-remote run:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf20/

Client log for ejb-remote run:
http://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/mvinkler_eap-7x-failover-ejb-ejbremote-shutdown-dist-async/3/console-perf17/

EDIT #2:
Can be very easily reproduced locally. Just start a cluster of at least 3 servers with clusterbench deployed. Then do few requests to various nodes until you get 500 as a response. Also notice, when shutting down a node, which logged TimeoutException before, the node won't stop and get stuck during the shutdown.

clones

JBEAP-5573 TimeoutException: Replication timeout when handling request with DIST cache

Verified

is blocked by

JBEAP-5614 Upgrade Infinispan to 8.2.4.Final

Verified

WFLY-6928 Upgrade Infinispan to 8.2.4.Final

Closed

is caused by

ISPN-6925 Race condition in staggered gets

Closed

ISPN-6745 Locks are lost in pessimistic cache

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates