-
Bug
-
Resolution: Done
-
Blocker
-
5.2.0.Final
-
None
Tomas noticed this a while ago in a specific functional test:
https://bugzilla.redhat.com/show_bug.cgi?id=875151
I'm creating a more general JIRA, cause I'm having this in resilience test.
What I found by quick debug, is that here:
for (segmentIdx <- 0 until numSegments) { val denormalizedSegmentHashIds = allDenormalizedHashIds(segmentIdx) val segmentOwners = ch.locateOwnersForSegment(segmentIdx) for (ownerIdx <- 0 until segmentOwners.length) { val address = segmentOwners(ownerIdx % segmentOwners.size) val serverAddress = members(address) val hashId = denormalizedSegmentHashIds(ownerIdx) log.tracef("Writing hash id %d for %s:%s", hashId, serverAddress.host, serverAddress.port) writeString(serverAddress.host, buf) writeUnsignedShort(serverAddress.port, buf) buf.writeInt(hashId) } }
we're trying to obtain serverAddress for nonexistent address and NoSuchElementException is not handled properly.
It hapens after I kill a node in a resilience test and the exception appears when querying for the node in the members cache.
[ISPN-2550] NoSuchElementException in Hot Rod Encoder
Michal Linhard <mlinhard@redhat.com> changed the Status of bug 886565 from ON_QA to VERIFIED
Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from MODIFIED to ON_QA
Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from ASSIGNED to MODIFIED
Sorry Michal, I didn't refresh the JIRA page before posting my comment.
I'm glad the fix works, I'll try to get a unit test working as well before issuing a PR though.
I've reduced number of entries in the cache during the test to 5000 1kb entries and I've got a clean resilience test run:
http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png
only expected exceptions:
http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/loganalysis/server/
there is still problem with uneven request balancing (ISPN-2632) and blocking of the whole system after join, when there's more data (5% heap filled), but it doesn't have to be related with issues we're discussing here.
Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from NEW to ASSIGNED
as I say ISPN-2642 didn't appear, so it seems to be fixed, I'm now investigating other problems I have with that test run.
@Galder, could we could modify the JIRA subject to say this one happens during leave and the other happens during join then?
@Michal, commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 should fix ISPN-2642 as well, have you tested it?
I've patched JDG 6.1.0.ER5 by replacing infinispan-core and infinispan-server-hotrod jars built from dan's branch
and ran resilience tests in hyperion
http://www.qa.jboss.com/~mlinhard/hyperion3/run0011/report/stats-throughput.png
the issues ISPN-2550 and ISPN-2642 didn't appear but the run still wasn't OK. After rejoin of killed node0002 all operations were blocked for more than 5 minutes - i.e. zero throughput in the last stage of the test. I'm investigating what happened there.
The IndexOutOfBoundsException appears independently of dan's fix so I created ISPN-2642
Plus, if there really is an when a node joins in (as opposed to killing), your fix won't work and would result in inbalances in the cluster... but let's not make judgements, let's see what ISPN-2624 is about and then we talk...
@Dan, ISPN-2624 is a different scenario. Happens when node starts up and one of the nodes is apparently set up for storage only (no Netty endpoint). To avoid confusion, I'm treating it as a different case right now, cos it smells like a misconfiguration. Michal's case is about killing nodes.
What is the issue in ISPN-2624? The subject looks the same to me
Michal, yes, the commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 was intended to fix the IndexOutOfBoundsException.
Tomas' functional issue has now been separated into ISPN-2624, leaving this JIRA fully focused on the situation when the nodes are killed.
Tomas, seems like the config that you provided works fine as storage only.
Can you create a separate issue to follow your issue? Don't wanna mix with node kill issue.
Also, if you can replicate the issue again and provide JDG version information, TRACE logs...etc? Can you try to replicate the issue on master of JDG too?
The IndexOutOfBoundsException was found when running with https://github.com/danberindei/infinispan/commit/c3325b134704016fa556343529d6a3a5b9a96bcb
btw now i can see another commit on the t_2550_m branch, would it still be helpful to test with it ?
Michal, what is the last commit you had when you ran the test?
Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 875151 from NEW to ASSIGNED
900MB of tasty tracelogs from runs with 5.2.0.Beta5 (resilience tests on hudson / perflab)
http://www.qa.jboss.com/~mlinhard/test_results/serverlogs-trace-ispn2550.zip
njoy!
Dan, I wanted to try your change, but I don't see any further commit on the branch https://github.com/danberindei/infinispan/tree/t_2550_m
The IndexOutOfBoundsException seems to appear because we're generating numOwners (2) "denormalized" hash ids for each segment, but the consistent hash has more than owners for one segment (3). This can happen during a join, when the write CH is a union between the previous CH and the new, balanced, CH.
Tomas, I've updated my branch to use the read CH instead, could you try again?
Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 875151
Yes, RCMs get the server list dynamically from the servers. However only the servers with an endpoint should add their address to the list.
Martin Gencur <mgencur@redhat.com> made a comment on bug 875151
Just a note about the test: When we create a RemoteCacheManager and passing just one address to it, it does not mean that all requsts through cache.put/get will go just to this one address but possibly to all nodes in the cluster. Is that right? AFAIK the HotRod client is dynamically getting the information about all clustered nodes and autonomously chooses one of the cluster nodes to send requests to. If my assumption is correct, we would need to use Memcached or REST client to properly test the storage-only example, not HotRod.
Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151
Hey dberinde@redhat.com, can you check that IndexOutOfBoundsException issue? Let's see if Michal can upload TRACE.
NadirX, Tomas issue appears to show a storage only node (which shouldn't have any endpoints, log ending in 49...) responding to a client request, so the endpoint is somehow active. Can you check the JDG configuration he's using to see if there's any issues there?
Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151
Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151
I've run tests locally with dan's fix and I'm seeing these exceptions:
11:19:23,919 ERROR [org.infinispan.server.hotrod.HotRodDecoder] (HotRodClientMaster-5) ISPN005009: Unexpected error before any request parameters read java.lang.IndexOutOfBoundsException: 2 at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:44) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1$$anonfun$apply$mcVI$sp$1.apply(AbstractTopologyAwareEncoder1x.scala:96) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1$$anonfun$apply$mcVI$sp$1.apply(AbstractTopologyAwareEncoder1x.scala:92) at scala.collection.immutable.Range.foreach(Range.scala:81) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1.apply$mcVI$sp(AbstractTopologyAwareEncoder1x.scala:92) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x.writeHashTopologyHeader(AbstractTopologyAwareEncoder1x.scala:89) at org.infinispan.server.hotrod.AbstractEncoder1x.writeHeader(AbstractEncoder1x.scala:62) at org.infinispan.server.hotrod.HotRodEncoder.encode(HotRodEncoder.scala:63) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:67) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:60) at org.jboss.netty.channel.Channels.write(Channels.java:712) at org.jboss.netty.channel.Channels.write(Channels.java:679) at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:248) at org.infinispan.server.core.AbstractProtocolDecoder.exceptionCaught(AbstractProtocolDecoder.scala:295) at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:533) at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:49) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:472) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:333) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662)
Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151
I attached surefire report from our test suite.
Galder, please, see test: trunk/remote/config-examples/standalone-storage-only/src/test/java/com.jboss.datagrid.test.examples.StorageOnlyConfigExampleTest.java
It is failing on line 73: rc1.put("k", "v");
This put caused attached stack trace.
We are starting one JDG server with standalone-ha.xml and the second JDG with standalone-storage-only.xml which you can find in jsgServer/docs/examples/configs.
Tomas' tracelog shows exactly the same spot as my scenario: https://bugzilla.redhat.com/attachment.cgi?id=641649 (I'm not sure about his test scenario though)
Tomas, I was wondering which of the functional tests you had developed was failing, and where (stacktrace of failure...etc). The idea is to replicate that specific test in the Infinispan codebase. Thanks.
Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151
Hi Galder,
I experience this problem in our functional test suite for remote mode (server).
preNOTE: you probably don'e need to install Arquillian project as it's CR1 is published already.
preNOTE: you need to create empty directory with name "bundles" in edg0/ edg1/ etc.
Please see this doc: https://docspace.corp.redhat.com/docs/DOC-87715
Download our tests from svn and run this specific test. (for storage only example)
Just sink to edgTest/trunk/remote and run
mvn -s ~/programs/eclipseWorkspace/settings_mead_jdg_plus_local.xml clean verify -Dstack=udp -pl config-examples/standalone-storage-only -Dnode0.edghome=/home/tsykora/edg0 -Dnode1.edghome=/home/tsykora/edg1 -Dnode2.edghome=/home/tsykora/edg2 -Dmaven.test.failure.ignore=true
NOTE: this user specific mvn setting file (-s) is pointing to my "local" repo which is comming with regular ER builds. You can ignore it and simply run this with these settings using MEAD repo:
https://svn.devel.redhat.com/repos/jboss-qa/jdg/scripts/settings_mead_jdg.xml
You can obtain latest JDG server from here: http://download.lab.bos.redhat.com/devel/jdg/stage/JDG-6.1.0-ER5/
I hope I didn't forget anything. In case of any problem, anything, let me know.
Right, that's true, I've just spoken with Tomas, he's gonna supply the way how to test this in his scenario.
I'll try to test dan's fix as well.
Michal, in the beginning you mentioned something about Tomas finding this in a functional test, that's the test I'm looking for
Also, if you can replicate the issue easily, can you try Dan's fix to see if it works?
And one more important thing: during the whole test, constant small load of multiple hotrod clients is applied. I think I had to have at least 10 locally for the bug to appear. Seems like it happens when they're receiving the new topology and it fails for some of them...
I found this using resilience test that's implemented in the distributed smartfrog framework that we run in our perflab , I don't have it in a simple test method.
What it does is this:
1. start 4 nodes
2. let them run 5 min
3. kill node2
4. wait for cluster of node1,node3,node4
5. wait 5 min
6. start node2
7. wait for cluster node1 - node4
8. wait 5 min
the exception happens in step 3 right after killing the node2.
I also managed to reproduce this locally running 4 nodes on my laptop - that's how I debugged it.
Michal, can you share the test so that we can map it to an Infinispan unit test and verify Dan's fix?
Dan, did you check the functional test Michal's referring to? You might be able to create a test out of that? I'm assigning to you since you're more familiar with these changes.
Galder, I think I have a fix for this issue: https://github.com/danberindei/infinispan/commit/3712ffac1ec1503f17b3f9de022bfc98a20b90e1
The problem is that I don't have a test to go with it, so I'm not sure if it really works. So I'm not issuing a PR, but I'm leaving it here for reference.
Michal Linhard <mlinhard@redhat.com> made a comment on bug 875151
I'm seeing this in resilience tests for 6.1.0.ER4,
I've created more general JIRA for this.
Michal Linhard <mlinhard@redhat.com> made a comment on bug 886565
Verified for 6.1.0.ER8