Loading...

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: 15.0.0.Beta1, 15.0.0.Final
Affects Version/s: None
Component/s: Clustering
Labels:
- blocker-WF14

Steps to Reproduce:
Hide

Steps to reproduce - issue is intermittent, not 100% reproducer. There are offered 2 ways to reproduce:

Option 1 - running locally on your laptop

git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout WFLY-10736 groovy -DEAP_ZIP_URL=<url-to-eap-or-wf-zip> PrepareServers7.groovy # for example file:///home/user/jboss-eap-7.x.x.zip or http://some-rul/wf-build.zip export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1531909463-SNAPSHOT clean install -B -Dtest=Lodh2TestCase#testRemoteJcaInboundOnly | tee log

Option 2 - running automatic Jenkins job

You can also use the following Jenkins job for running reproducer against your custom build WF/EAP build:
https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/eap-7.x-messaging-retry-one-test-trigger-face/

Note that you need to upload your EAP/WF zip to dropbox or some public place where it can be accessed by the job. Then use the same parameters as in following build [1] where issue was reproduced with Wildfly 14. In parameters of the job replace value of EAP_ZIP_URL parameter by URL to your customer WF/EAP zip, for example on dropbox.

[1] https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/eap-7.x-messaging-retry-one-test-trigger-face/183/parameters/
Show
Steps to reproduce - issue is intermittent, not 100% reproducer. There are offered 2 ways to reproduce: Option 1 - running locally on your laptop git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout WFLY-10736 groovy -DEAP_ZIP_URL=<url-to-eap-or-wf-zip> PrepareServers7.groovy # for example file: ///home/user/jboss-eap-7.x.x.zip or http://some-rul/wf-build.zip export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1531909463-SNAPSHOT clean install -B -Dtest=Lodh2TestCase#testRemoteJcaInboundOnly | tee log Option 2 - running automatic Jenkins job You can also use the following Jenkins job for running reproducer against your custom build WF/EAP build: https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/eap-7.x-messaging-retry-one-test-trigger-face/ Note that you need to upload your EAP/WF zip to dropbox or some public place where it can be accessed by the job. Then use the same parameters as in following build [1] where issue was reproduced with Wildfly 14. In parameters of the job replace value of EAP_ZIP_URL parameter by URL to your customer WF/EAP zip, for example on dropbox. [1] https://eap-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/eap-7.x-messaging-retry-one-test-trigger-face/183/parameters/
Workaround Description:

Hide

To improve test reliability, add a delay between server kill and restart.

Show
To improve test reliability, add a delay between server kill and restart.

There is regression in JGroups or Infinispan in one of our tests for fault tolerance of JMS bridges. However work on JMS bridge appears to be unrelated. Issue was hit in WF weekly run.

Test Scenario:

There are two servers. InQueue is deployed on Node 1,
OutQueue is deployed on Node 2. Both servers are started.
Large byte messages are sent to InQueue deployed on Node 1. Bridge between servers/queues transfers messages from node 1 to node 2.
Node 1 is killed and started again.
All messages are received from OutQueue deployed on Node 2.

Result:
Node 1 does not start after kill and hangs. There is following exception logged in node 2:

:26:17,894 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100000: Node node-1 joined the cluster
09:26:18,520 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
09:26:18,521 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
09:26:18,522 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN000094: Received new cluster view for channel ejb: [node-2|7] (2) [node-2, node-1]
09:26:18,523 INFO  [org.infinispan.CLUSTER] (thread-12,ejb,node-2) ISPN100001: Node node-1 left the cluster
09:26:18,868 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) ISPN000310: Starting cluster-wide rebalance for cache default, topology CacheTopology{id=17, phase=READ_OLD_WRITE_ALL, rebalanceId=6, currentCH=ReplicatedConsistentHash{ns = 256, owners = (2)[node-2: 122, node-1: 134]}, pendingCH=ReplicatedConsistentHash{ns = 256, owners = (3)[node-2: 84, node-1: 90, node-1: 82]}, unionCH=null, actualMembers=[node-2, node-1, node-1], persistentUUIDs=[12443bfb-e88a-46f3-919e-9213bf38ce19, 2873237f-d881-463f-8a5a-940bf1d764e5, a05ea8af-a83b-42a9-b937-dc2da1cae6d1]}
09:26:18,869 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-2]ISPN100002: Started rebalance with topology id 17
09:26:18,870 INFO  [org.infinispan.CLUSTER] (transport-thread--p14-t5) [Context=default][Scope=node-2]ISPN100003: Node node-2 finished rebalance phase with topology id 17
09:26:18,981 INFO  [org.infinispan.CLUSTER] (remote-thread--p5-t2) [Context=default][Scope=node-1]ISPN100003: Node node-1 finished rebalance phase with topology id 17
09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p15-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
		... 1 more
	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
		... 1 more
	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]

09:27:18,530 WARN  [org.infinispan.topology.ClusterTopologyManagerImpl] (transport-thread--p16-t4) ISPN000197: Error updating cluster member list: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
	at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
	at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [rt.jar:1.8.0_131]
	Suppressed: java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
		at java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357) [rt.jar:1.8.0_131]
		at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) [rt.jar:1.8.0_131]
		at org.infinispan.util.concurrent.CompletableFutures.await(CompletableFutures.java:82)
		at org.infinispan.remoting.transport.Transport.invokeRemotely(Transport.java:71)
		at org.infinispan.topology.ClusterTopologyManagerImpl.confirmMembersAvailable(ClusterTopologyManagerImpl.java:540)
		at org.infinispan.topology.ClusterTopologyManagerImpl.updateCacheMembers(ClusterTopologyManagerImpl.java:523)
		at org.infinispan.topology.ClusterTopologyManagerImpl.handleClusterView(ClusterTopologyManagerImpl.java:334)
		at org.infinispan.topology.ClusterTopologyManagerImpl.access$500(ClusterTopologyManagerImpl.java:85)
		at org.infinispan.topology.ClusterTopologyManagerImpl$ClusterViewListener.lambda$handleViewChange$0(ClusterTopologyManagerImpl.java:745)
		at org.infinispan.executors.LimitedExecutor.runTasks(LimitedExecutor.java:144)
		at org.infinispan.executors.LimitedExecutor.access$100(LimitedExecutor.java:33)
		at org.infinispan.executors.LimitedExecutor$Runner.run(LimitedExecutor.java:174)
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
		at org.wildfly.clustering.service.concurrent.ClassLoaderThreadFactory.lambda$newThread$0(ClassLoaderThreadFactory.java:47)
		... 1 more
	Caused by: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1
		at org.infinispan.remoting.transport.impl.MultiTargetRequest.onTimeout(MultiTargetRequest.java:167)
		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:87)
		at org.infinispan.remoting.transport.AbstractRequest.call(AbstractRequest.java:22)
		at java.util.concurrent.FutureTask.run(FutureTask.java:266) [rt.jar:1.8.0_131]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_131]
		at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [rt.jar:1.8.0_131]
		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_131]
		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_131]
		... 1 more
	[CIRCULAR REFERENCE:java.util.concurrent.ExecutionException: org.infinispan.util.concurrent.TimeoutException: ISPN000476: Timed out waiting for responses for request 4 from node-1]

There is default JGroups udp stack configured which is used by Infinispan. Both of the servers (jgroups udp) are bound to 127.0.0.1. Node 2 has port offset 1000.

Attaching thread dump from node 1 when it hangs during start.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

clusterKilTest.zip
1.26 MB
2018/07/27 2:32 AM
Lodh2TestCase.testRemoteJcaInboundOnly.zip
3.11 MB
2018/08/14 9:29 AM
Lodh2TestCase.testRemoteJcaInboundOnly2.zip
500 kB
2018/08/28 4:44 AM
Lodh2TestCase.testRemoteJcaInboundOnly-traces.zip
4.78 MB
2018/08/15 9:22 AM
logs-traces.zip
2.71 MB
2018/08/28 7:19 AM
logs-traces2.zip
4.79 MB
2018/08/29 3:01 AM
logs-traces3.zip
4.87 MB
2018/09/06 5:05 AM
logs-traces4.zip
20.64 MB
2018/09/11 3:25 AM
logs-with-workaround.zip
510 kB
2018/07/30 8:39 AM
node-1-thread-dump-before-kill-shutdown-sequence.txt
94 kB
2018/07/25 3:37 AM
server-with-mdb.log
299 kB
2018/08/13 3:03 AM
standalone-full-ha-1.xml
36 kB
2018/07/25 3:40 AM
standalone-full-ha-2.xml
35 kB
2018/07/25 3:40 AM

is blocked by

WFLY-10956 Upgrade Infinispan to 9.3.3.Final

Closed

WFLY-11018 Upgrade JGroups to 4.0.15.Final

Closed

WFLY-11020 Upgrade Infinispan to 9.3.4.Final

Closed

is caused by

ISPN-9465 Initial state transfer timeout for joining node following previous kill if cache does not exist on the coordinator

Closed

ISPN-9517 State transfer times out if initiated with yet to be verified suspected member and reincarnated member

Closed

JGRP-2286 VERIFY_SUSPECT can return false positives if member is killed and immediately restarted

Closed

WFLY-11056 Singleton MDB barrier service starts prematurely

Closed

is cloned by

JBEAP-15404 Server in cluster hangs during start after previous kill

Closed

WFLY-10787 Eagerly started clustering services can fail if server starts during a period of membership instability

Closed

is incorporated by

WFLY-10839 RetryingInvoker fails with millisecond timeouts (IllegalArgumentException: nanosecond timeout value out of range)

Closed

WFLY-10840 CacheServiceProviderRegistry#INVOKER interval is too large (500 seconds)

Closed

(2 is caused by, 2 is cloned by, 2 is incorporated by)

Details

Option 1 - running locally on your laptop

Option 2 - running automatic Jenkins job

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates