Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-2550

NoSuchElementException in Hot Rod Encoder

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • 5.2.0.Final
    • 5.2.0.Final
    • Remote Protocols
    • None

      Tomas noticed this a while ago in a specific functional test:
      https://bugzilla.redhat.com/show_bug.cgi?id=875151

      I'm creating a more general JIRA, cause I'm having this in resilience test.

      What I found by quick debug, is that here:

      https://github.com/infinispan/infinispan/blob/master/server/hotrod/src/main/scala/org/infinispan/server/hotrod/Encoders.scala#L106

                     for (segmentIdx <- 0 until numSegments) {
                        val denormalizedSegmentHashIds = allDenormalizedHashIds(segmentIdx)
                        val segmentOwners = ch.locateOwnersForSegment(segmentIdx)
                        for (ownerIdx <- 0 until segmentOwners.length) {
                           val address = segmentOwners(ownerIdx % segmentOwners.size)
                           val serverAddress = members(address)
                           val hashId = denormalizedSegmentHashIds(ownerIdx)
                           log.tracef("Writing hash id %d for %s:%s", hashId, serverAddress.host, serverAddress.port)
                           writeString(serverAddress.host, buf)
                           writeUnsignedShort(serverAddress.port, buf)
                           buf.writeInt(hashId)
                        }
                     }
      

      we're trying to obtain serverAddress for nonexistent address and NoSuchElementException is not handled properly.
      It hapens after I kill a node in a resilience test and the exception appears when querying for the node in the members cache.

            [ISPN-2550] NoSuchElementException in Hot Rod Encoder

            Michal Linhard <mlinhard@redhat.com> made a comment on bug 886565

            Verified for 6.1.0.ER8

            RH Bugzilla Integration added a comment - Michal Linhard <mlinhard@redhat.com> made a comment on bug 886565 Verified for 6.1.0.ER8

            Michal Linhard <mlinhard@redhat.com> changed the Status of bug 886565 from ON_QA to VERIFIED

            RH Bugzilla Integration added a comment - Michal Linhard <mlinhard@redhat.com> changed the Status of bug 886565 from ON_QA to VERIFIED

            Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from MODIFIED to ON_QA

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from MODIFIED to ON_QA

            Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from ASSIGNED to MODIFIED

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from ASSIGNED to MODIFIED

            Sorry Michal, I didn't refresh the JIRA page before posting my comment.

            I'm glad the fix works, I'll try to get a unit test working as well before issuing a PR though.

            Dan Berindei (Inactive) added a comment - Sorry Michal, I didn't refresh the JIRA page before posting my comment. I'm glad the fix works, I'll try to get a unit test working as well before issuing a PR though.

            I've reduced number of entries in the cache during the test to 5000 1kb entries and I've got a clean resilience test run:

            http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png
            only expected exceptions:
            http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/loganalysis/server/

            there is still problem with uneven request balancing (ISPN-2632) and blocking of the whole system after join, when there's more data (5% heap filled), but it doesn't have to be related with issues we're discussing here.

            Michal Linhard (Inactive) added a comment - I've reduced number of entries in the cache during the test to 5000 1kb entries and I've got a clean resilience test run: http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/stats-throughput.png only expected exceptions: http://www.qa.jboss.com/~mlinhard/hyperion3/run0013/report/loganalysis/server/ there is still problem with uneven request balancing ( ISPN-2632 ) and blocking of the whole system after join, when there's more data (5% heap filled), but it doesn't have to be related with issues we're discussing here.

            Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from NEW to ASSIGNED

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 886565 from NEW to ASSIGNED

            as I say ISPN-2642 didn't appear, so it seems to be fixed, I'm now investigating other problems I have with that test run.

            Michal Linhard (Inactive) added a comment - as I say ISPN-2642 didn't appear, so it seems to be fixed, I'm now investigating other problems I have with that test run.

            @Galder, could we could modify the JIRA subject to say this one happens during leave and the other happens during join then?

            @Michal, commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 should fix ISPN-2642 as well, have you tested it?

            Dan Berindei (Inactive) added a comment - @Galder, could we could modify the JIRA subject to say this one happens during leave and the other happens during join then? @Michal, commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 should fix ISPN-2642 as well, have you tested it?

            I've patched JDG 6.1.0.ER5 by replacing infinispan-core and infinispan-server-hotrod jars built from dan's branch
            and ran resilience tests in hyperion

            http://www.qa.jboss.com/~mlinhard/hyperion3/run0011/report/stats-throughput.png

            the issues ISPN-2550 and ISPN-2642 didn't appear but the run still wasn't OK. After rejoin of killed node0002 all operations were blocked for more than 5 minutes - i.e. zero throughput in the last stage of the test. I'm investigating what happened there.

            Michal Linhard (Inactive) added a comment - I've patched JDG 6.1.0.ER5 by replacing infinispan-core and infinispan-server-hotrod jars built from dan's branch and ran resilience tests in hyperion http://www.qa.jboss.com/~mlinhard/hyperion3/run0011/report/stats-throughput.png the issues ISPN-2550 and ISPN-2642 didn't appear but the run still wasn't OK. After rejoin of killed node0002 all operations were blocked for more than 5 minutes - i.e. zero throughput in the last stage of the test. I'm investigating what happened there.

            The IndexOutOfBoundsException appears independently of dan's fix so I created ISPN-2642

            Michal Linhard (Inactive) added a comment - The IndexOutOfBoundsException appears independently of dan's fix so I created ISPN-2642

            Plus, if there really is an when a node joins in (as opposed to killing), your fix won't work and would result in inbalances in the cluster... but let's not make judgements, let's see what ISPN-2624 is about and then we talk...

            Galder Zamarreño added a comment - Plus, if there really is an when a node joins in (as opposed to killing), your fix won't work and would result in inbalances in the cluster... but let's not make judgements, let's see what ISPN-2624 is about and then we talk...

            @Dan, ISPN-2624 is a different scenario. Happens when node starts up and one of the nodes is apparently set up for storage only (no Netty endpoint). To avoid confusion, I'm treating it as a different case right now, cos it smells like a misconfiguration. Michal's case is about killing nodes.

            Galder Zamarreño added a comment - @Dan, ISPN-2624 is a different scenario. Happens when node starts up and one of the nodes is apparently set up for storage only (no Netty endpoint). To avoid confusion, I'm treating it as a different case right now, cos it smells like a misconfiguration. Michal's case is about killing nodes.

            Dan Berindei (Inactive) added a comment - - edited

            What is the issue in ISPN-2624? The subject looks the same to me

            Michal, yes, the commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 was intended to fix the IndexOutOfBoundsException.

            Dan Berindei (Inactive) added a comment - - edited What is the issue in ISPN-2624 ? The subject looks the same to me Michal, yes, the commit https://github.com/danberindei/infinispan/commit/754b9de995221075e14bba7fa459e597bdb16287 was intended to fix the IndexOutOfBoundsException.

            Tomas' functional issue has now been separated into ISPN-2624, leaving this JIRA fully focused on the situation when the nodes are killed.

            Galder Zamarreño added a comment - Tomas' functional issue has now been separated into ISPN-2624 , leaving this JIRA fully focused on the situation when the nodes are killed.

            Tomas, seems like the config that you provided works fine as storage only.

            Can you create a separate issue to follow your issue? Don't wanna mix with node kill issue.

            Also, if you can replicate the issue again and provide JDG version information, TRACE logs...etc? Can you try to replicate the issue on master of JDG too?

            Galder Zamarreño added a comment - Tomas, seems like the config that you provided works fine as storage only. Can you create a separate issue to follow your issue? Don't wanna mix with node kill issue. Also, if you can replicate the issue again and provide JDG version information, TRACE logs...etc? Can you try to replicate the issue on master of JDG too?

            The IndexOutOfBoundsException was found when running with https://github.com/danberindei/infinispan/commit/c3325b134704016fa556343529d6a3a5b9a96bcb

            btw now i can see another commit on the t_2550_m branch, would it still be helpful to test with it ?

            Michal Linhard (Inactive) added a comment - The IndexOutOfBoundsException was found when running with https://github.com/danberindei/infinispan/commit/c3325b134704016fa556343529d6a3a5b9a96bcb btw now i can see another commit on the t_2550_m branch, would it still be helpful to test with it ?

            Michal, what is the last commit you had when you ran the test?

            Dan Berindei (Inactive) added a comment - Michal, what is the last commit you had when you ran the test?

            Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 875151 from NEW to ASSIGNED

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> changed the Status of bug 875151 from NEW to ASSIGNED

            900MB of tasty tracelogs from runs with 5.2.0.Beta5 (resilience tests on hudson / perflab)

            http://www.qa.jboss.com/~mlinhard/test_results/serverlogs-trace-ispn2550.zip

            njoy!

            Michal Linhard (Inactive) added a comment - 900MB of tasty tracelogs from runs with 5.2.0.Beta5 (resilience tests on hudson / perflab) http://www.qa.jboss.com/~mlinhard/test_results/serverlogs-trace-ispn2550.zip njoy!

            Dan, I wanted to try your change, but I don't see any further commit on the branch https://github.com/danberindei/infinispan/tree/t_2550_m

            Michal Linhard (Inactive) added a comment - Dan, I wanted to try your change, but I don't see any further commit on the branch https://github.com/danberindei/infinispan/tree/t_2550_m

            The IndexOutOfBoundsException seems to appear because we're generating numOwners (2) "denormalized" hash ids for each segment, but the consistent hash has more than owners for one segment (3). This can happen during a join, when the write CH is a union between the previous CH and the new, balanced, CH.

            Tomas, I've updated my branch to use the read CH instead, could you try again?

            Dan Berindei (Inactive) added a comment - The IndexOutOfBoundsException seems to appear because we're generating numOwners (2) "denormalized" hash ids for each segment, but the consistent hash has more than owners for one segment (3). This can happen during a join, when the write CH is a union between the previous CH and the new, balanced, CH. Tomas, I've updated my branch to use the read CH instead, could you try again?

            Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 875151

            Yes, RCMs get the server list dynamically from the servers. However only the servers with an endpoint should add their address to the list.

            RH Bugzilla Integration added a comment - Tristan Tarrant <ttarrant@redhat.com> made a comment on bug 875151 Yes, RCMs get the server list dynamically from the servers. However only the servers with an endpoint should add their address to the list.

            Martin Gencur <mgencur@redhat.com> made a comment on bug 875151

            Just a note about the test: When we create a RemoteCacheManager and passing just one address to it, it does not mean that all requsts through cache.put/get will go just to this one address but possibly to all nodes in the cluster. Is that right? AFAIK the HotRod client is dynamically getting the information about all clustered nodes and autonomously chooses one of the cluster nodes to send requests to. If my assumption is correct, we would need to use Memcached or REST client to properly test the storage-only example, not HotRod.

            RH Bugzilla Integration added a comment - Martin Gencur <mgencur@redhat.com> made a comment on bug 875151 Just a note about the test: When we create a RemoteCacheManager and passing just one address to it, it does not mean that all requsts through cache.put/get will go just to this one address but possibly to all nodes in the cluster. Is that right? AFAIK the HotRod client is dynamically getting the information about all clustered nodes and autonomously chooses one of the cluster nodes to send requests to. If my assumption is correct, we would need to use Memcached or REST client to properly test the storage-only example, not HotRod.

            Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151

            RH Bugzilla Integration added a comment - Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151

            Hey dberinde@redhat.com, can you check that IndexOutOfBoundsException issue? Let's see if Michal can upload TRACE.

            NadirX, Tomas issue appears to show a storage only node (which shouldn't have any endpoints, log ending in 49...) responding to a client request, so the endpoint is somehow active. Can you check the JDG configuration he's using to see if there's any issues there?

            Galder Zamarreño added a comment - Hey dberinde@redhat.com , can you check that IndexOutOfBoundsException issue? Let's see if Michal can upload TRACE. NadirX , Tomas issue appears to show a storage only node (which shouldn't have any endpoints, log ending in 49...) responding to a client request, so the endpoint is somehow active. Can you check the JDG configuration he's using to see if there's any issues there?

            Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151

            RH Bugzilla Integration added a comment - Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151

            RH Bugzilla Integration added a comment - Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151 https://svn.devel.redhat.com/repos/jboss-qa/jdg/jdg-functional-tests/trunk/remote/config-examples/standalone-storage-only/src/test/java/com/jboss/datagrid/test/examples/StorageOnlyConfigExampleTest.java

            I've run tests locally with dan's fix and I'm seeing these exceptions:

            11:19:23,919 ERROR [org.infinispan.server.hotrod.HotRodDecoder] (HotRodClientMaster-5) ISPN005009: Unexpected error before any request parameters read
            java.lang.IndexOutOfBoundsException: 2
            	at scala.collection.mutable.ResizableArray$class.apply(ResizableArray.scala:44)
            	at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47)
            	at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1$$anonfun$apply$mcVI$sp$1.apply(AbstractTopologyAwareEncoder1x.scala:96)
            	at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1$$anonfun$apply$mcVI$sp$1.apply(AbstractTopologyAwareEncoder1x.scala:92)
            	at scala.collection.immutable.Range.foreach(Range.scala:81)
            	at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1.apply$mcVI$sp(AbstractTopologyAwareEncoder1x.scala:92)
            	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78)
            	at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x.writeHashTopologyHeader(AbstractTopologyAwareEncoder1x.scala:89)
            	at org.infinispan.server.hotrod.AbstractEncoder1x.writeHeader(AbstractEncoder1x.scala:62)
            	at org.infinispan.server.hotrod.HotRodEncoder.encode(HotRodEncoder.scala:63)
            	at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:67)
            	at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:60)
            	at org.jboss.netty.channel.Channels.write(Channels.java:712)
            	at org.jboss.netty.channel.Channels.write(Channels.java:679)
            	at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:248)
            	at org.infinispan.server.core.AbstractProtocolDecoder.exceptionCaught(AbstractProtocolDecoder.scala:295)
            	at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:533)
            	at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:49)
            	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
            	at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
            	at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84)
            	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:472)
            	at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:333)
            	at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
            	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
            	at java.lang.Thread.run(Thread.java:662)
            

            Michal Linhard (Inactive) added a comment - I've run tests locally with dan's fix and I'm seeing these exceptions: 11:19:23,919 ERROR [org.infinispan.server.hotrod.HotRodDecoder] (HotRodClientMaster-5) ISPN005009: Unexpected error before any request parameters read java.lang.IndexOutOfBoundsException: 2 at scala.collection.mutable.ResizableArray$ class. apply(ResizableArray.scala:44) at scala.collection.mutable.ArrayBuffer.apply(ArrayBuffer.scala:47) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1$$anonfun$apply$mcVI$sp$1.apply(AbstractTopologyAwareEncoder1x.scala:96) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1$$anonfun$apply$mcVI$sp$1.apply(AbstractTopologyAwareEncoder1x.scala:92) at scala.collection.immutable.Range.foreach(Range.scala:81) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x$$anonfun$writeHashTopologyHeader$1.apply$mcVI$sp(AbstractTopologyAwareEncoder1x.scala:92) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:78) at org.infinispan.server.hotrod.AbstractTopologyAwareEncoder1x.writeHashTopologyHeader(AbstractTopologyAwareEncoder1x.scala:89) at org.infinispan.server.hotrod.AbstractEncoder1x.writeHeader(AbstractEncoder1x.scala:62) at org.infinispan.server.hotrod.HotRodEncoder.encode(HotRodEncoder.scala:63) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.doEncode(OneToOneEncoder.java:67) at org.jboss.netty.handler.codec.oneone.OneToOneEncoder.handleDownstream(OneToOneEncoder.java:60) at org.jboss.netty.channel.Channels.write(Channels.java:712) at org.jboss.netty.channel.Channels.write(Channels.java:679) at org.jboss.netty.channel.AbstractChannel.write(AbstractChannel.java:248) at org.infinispan.server.core.AbstractProtocolDecoder.exceptionCaught(AbstractProtocolDecoder.scala:295) at org.jboss.netty.channel.Channels.fireExceptionCaught(Channels.java:533) at org.jboss.netty.channel.AbstractChannelSink.exceptionCaught(AbstractChannelSink.java:49) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268) at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255) at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:84) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.processSelectedKeys(AbstractNioWorker.java:472) at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:333) at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:35) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang. Thread .run( Thread .java:662)

            Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151

            I attached surefire report from our test suite.
            Galder, please, see test: trunk/remote/config-examples/standalone-storage-only/src/test/java/com.jboss.datagrid.test.examples.StorageOnlyConfigExampleTest.java

            It is failing on line 73: rc1.put("k", "v");

            This put caused attached stack trace.
            We are starting one JDG server with standalone-ha.xml and the second JDG with standalone-storage-only.xml which you can find in jsgServer/docs/examples/configs.

            RH Bugzilla Integration added a comment - Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151 I attached surefire report from our test suite. Galder, please, see test: trunk/remote/config-examples/standalone-storage-only/src/test/java/com.jboss.datagrid.test.examples.StorageOnlyConfigExampleTest.java It is failing on line 73: rc1.put("k", "v"); This put caused attached stack trace. We are starting one JDG server with standalone-ha.xml and the second JDG with standalone-storage-only.xml which you can find in jsgServer/docs/examples/configs.

            Tomas' tracelog shows exactly the same spot as my scenario: https://bugzilla.redhat.com/attachment.cgi?id=641649 (I'm not sure about his test scenario though)

            Michal Linhard (Inactive) added a comment - Tomas' tracelog shows exactly the same spot as my scenario: https://bugzilla.redhat.com/attachment.cgi?id=641649 (I'm not sure about his test scenario though)

            Tomas, I was wondering which of the functional tests you had developed was failing, and where (stacktrace of failure...etc). The idea is to replicate that specific test in the Infinispan codebase. Thanks.

            Galder Zamarreño added a comment - Tomas, I was wondering which of the functional tests you had developed was failing, and where (stacktrace of failure...etc). The idea is to replicate that specific test in the Infinispan codebase. Thanks.

            Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151

            Hi Galder,

            I experience this problem in our functional test suite for remote mode (server).

            preNOTE: you probably don'e need to install Arquillian project as it's CR1 is published already.
            preNOTE: you need to create empty directory with name "bundles" in edg0/ edg1/ etc.

            Please see this doc: https://docspace.corp.redhat.com/docs/DOC-87715
            Download our tests from svn and run this specific test. (for storage only example)

            Just sink to edgTest/trunk/remote and run

            mvn -s ~/programs/eclipseWorkspace/settings_mead_jdg_plus_local.xml clean verify -Dstack=udp -pl config-examples/standalone-storage-only -Dnode0.edghome=/home/tsykora/edg0 -Dnode1.edghome=/home/tsykora/edg1 -Dnode2.edghome=/home/tsykora/edg2 -Dmaven.test.failure.ignore=true

            NOTE: this user specific mvn setting file (-s) is pointing to my "local" repo which is comming with regular ER builds. You can ignore it and simply run this with these settings using MEAD repo:

            https://svn.devel.redhat.com/repos/jboss-qa/jdg/scripts/settings_mead_jdg.xml

            You can obtain latest JDG server from here: http://download.lab.bos.redhat.com/devel/jdg/stage/JDG-6.1.0-ER5/

            I hope I didn't forget anything. In case of any problem, anything, let me know.

            RH Bugzilla Integration added a comment - Tomas Sykora <tsykora@redhat.com> made a comment on bug 875151 Hi Galder, I experience this problem in our functional test suite for remote mode (server). preNOTE: you probably don'e need to install Arquillian project as it's CR1 is published already. preNOTE: you need to create empty directory with name "bundles" in edg0/ edg1/ etc. Please see this doc: https://docspace.corp.redhat.com/docs/DOC-87715 Download our tests from svn and run this specific test. (for storage only example) Just sink to edgTest/trunk/remote and run mvn -s ~/programs/eclipseWorkspace/settings_mead_jdg_plus_local.xml clean verify -Dstack=udp -pl config-examples/standalone-storage-only -Dnode0.edghome=/home/tsykora/edg0 -Dnode1.edghome=/home/tsykora/edg1 -Dnode2.edghome=/home/tsykora/edg2 -Dmaven.test.failure.ignore=true NOTE: this user specific mvn setting file (-s) is pointing to my "local" repo which is comming with regular ER builds. You can ignore it and simply run this with these settings using MEAD repo: https://svn.devel.redhat.com/repos/jboss-qa/jdg/scripts/settings_mead_jdg.xml You can obtain latest JDG server from here: http://download.lab.bos.redhat.com/devel/jdg/stage/JDG-6.1.0-ER5/ I hope I didn't forget anything. In case of any problem, anything, let me know.

            Right, that's true, I've just spoken with Tomas, he's gonna supply the way how to test this in his scenario.
            I'll try to test dan's fix as well.

            Michal Linhard (Inactive) added a comment - Right, that's true, I've just spoken with Tomas, he's gonna supply the way how to test this in his scenario. I'll try to test dan's fix as well.

            Michal, in the beginning you mentioned something about Tomas finding this in a functional test, that's the test I'm looking for

            Also, if you can replicate the issue easily, can you try Dan's fix to see if it works?

            Galder Zamarreño added a comment - Michal, in the beginning you mentioned something about Tomas finding this in a functional test, that's the test I'm looking for Also, if you can replicate the issue easily, can you try Dan's fix to see if it works?

            And one more important thing: during the whole test, constant small load of multiple hotrod clients is applied. I think I had to have at least 10 locally for the bug to appear. Seems like it happens when they're receiving the new topology and it fails for some of them...

            Michal Linhard (Inactive) added a comment - And one more important thing: during the whole test, constant small load of multiple hotrod clients is applied. I think I had to have at least 10 locally for the bug to appear. Seems like it happens when they're receiving the new topology and it fails for some of them...

            I found this using resilience test that's implemented in the distributed smartfrog framework that we run in our perflab , I don't have it in a simple test method.

            What it does is this:
            1. start 4 nodes
            2. let them run 5 min
            3. kill node2
            4. wait for cluster of node1,node3,node4
            5. wait 5 min
            6. start node2
            7. wait for cluster node1 - node4
            8. wait 5 min

            the exception happens in step 3 right after killing the node2.
            I also managed to reproduce this locally running 4 nodes on my laptop - that's how I debugged it.

            Michal Linhard (Inactive) added a comment - I found this using resilience test that's implemented in the distributed smartfrog framework that we run in our perflab , I don't have it in a simple test method. What it does is this: 1. start 4 nodes 2. let them run 5 min 3. kill node2 4. wait for cluster of node1,node3,node4 5. wait 5 min 6. start node2 7. wait for cluster node1 - node4 8. wait 5 min the exception happens in step 3 right after killing the node2. I also managed to reproduce this locally running 4 nodes on my laptop - that's how I debugged it.

            Michal, can you share the test so that we can map it to an Infinispan unit test and verify Dan's fix?

            Galder Zamarreño added a comment - Michal, can you share the test so that we can map it to an Infinispan unit test and verify Dan's fix?

            Dan, did you check the functional test Michal's referring to? You might be able to create a test out of that? I'm assigning to you since you're more familiar with these changes.

            Galder Zamarreño added a comment - Dan, did you check the functional test Michal's referring to? You might be able to create a test out of that? I'm assigning to you since you're more familiar with these changes.

            Galder, I think I have a fix for this issue: https://github.com/danberindei/infinispan/commit/3712ffac1ec1503f17b3f9de022bfc98a20b90e1

            The problem is that I don't have a test to go with it, so I'm not sure if it really works. So I'm not issuing a PR, but I'm leaving it here for reference.

            Dan Berindei (Inactive) added a comment - Galder, I think I have a fix for this issue: https://github.com/danberindei/infinispan/commit/3712ffac1ec1503f17b3f9de022bfc98a20b90e1 The problem is that I don't have a test to go with it, so I'm not sure if it really works. So I'm not issuing a PR, but I'm leaving it here for reference.

            Michal Linhard <mlinhard@redhat.com> made a comment on bug 875151

            I'm seeing this in resilience tests for 6.1.0.ER4,
            I've created more general JIRA for this.

            RH Bugzilla Integration added a comment - Michal Linhard <mlinhard@redhat.com> made a comment on bug 875151 I'm seeing this in resilience tests for 6.1.0.ER4, I've created more general JIRA for this.

              dberinde@redhat.com Dan Berindei (Inactive)
              mlinhard Michal Linhard (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: