Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-2415

Initial state transfer timed out - Fail to start 2 nodes after they were killed inside 8-node cluster

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Duplicate
    • Icon: Critical Critical
    • None
    • 5.2.0.Final
    • State Transfer
    • None

      We start 8 nodes, keep them under load, than we kill 2 nodes and later start them again. However, when we are trying to start them, the following exception is thrown and the test fails:

      10:47:52,830 ERROR [org.radargun.stages.helpers.StartHelper] (pool-1-thread-1) Issues while instantiating/starting cache wrapper
      org.infinispan.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete() throws java.lang.InterruptedException on object of type StateTransferManagerImpl
      	at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:205)
      	at org.infinispan.factories.AbstractComponentRegistry$PrioritizedMethod.invoke(AbstractComponentRegistry.java:879)
      	at org.infinispan.factories.AbstractComponentRegistry.invokeStartMethods(AbstractComponentRegistry.java:650)
      	at org.infinispan.factories.AbstractComponentRegistry.internalStart(AbstractComponentRegistry.java:639)
      	at org.infinispan.factories.AbstractComponentRegistry.start(AbstractComponentRegistry.java:542)
      	at org.infinispan.factories.ComponentRegistry.start(ComponentRegistry.java:198)
      	at org.infinispan.CacheImpl.start(CacheImpl.java:517)
      	at org.infinispan.manager.DefaultCacheManager.wireAndStartCache(DefaultCacheManager.java:689)
      	at org.infinispan.manager.DefaultCacheManager.createCache(DefaultCacheManager.java:652)
      	at org.infinispan.manager.DefaultCacheManager.getCache(DefaultCacheManager.java:548)
      	at org.radargun.cachewrappers.InfinispanWrapper.setUpCache(InfinispanWrapper.java:125)
      	at org.radargun.cachewrappers.InfinispanWrapper.setUp(InfinispanWrapper.java:74)
      	at org.radargun.stages.helpers.StartHelper.start(StartHelper.java:63)
      	at org.radargun.stages.StartClusterStage.executeOnSlave(StartClusterStage.java:47)
      	at org.radargun.Slave$2.run(Slave.java:103)
      	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
      	at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
      	at java.util.concurrent.FutureTask.run(FutureTask.java:138)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
      	at java.lang.Thread.run(Thread.java:662)
      Caused by: org.infinispan.CacheException: Initial state transfer timed out for cache testCache on edg-perf02-25863
      	at org.infinispan.statetransfer.StateTransferManagerImpl.waitForInitialStateTransferToComplete(StateTransferManagerImpl.java:202)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      	at java.lang.reflect.Method.invoke(Method.java:597)
      	at org.infinispan.util.ReflectionUtil.invokeAccessibly(ReflectionUtil.java:203)
      	... 20 more
      

      The problem happens at nodes edg-perf02 and edg-perf03 under this Jenkins run: https://jenkins.mw.lab.eng.bos.redhat.com/hudson/job/ispn-52-radargun-resilience-8-6/29/

      Debug log can be found at those machines.

      A few more hints:

      • there are individual exceptions/errors extracted from the log - available in the "Build artifacts"
      • this job passed only once, fails otherwise
      • state transfer timeout is the default one (4 min?)
      • version of Infinspan: 5.2.0-SNAPSHOT, HEAD=d4581e570 - ISPN-2387 ClusteredGetCommand should not be a VisitableCommand

      Infinispan configuration:

      <infinispan
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="urn:infinispan:config:5.2 http://www.infinispan.org/schemas/infinispan-config-5.2.xsd"
            xmlns="urn:infinispan:config:5.2">
      
         <global>
            <globalJmxStatistics
                  enabled="true"
                  jmxDomain="jboss.infinispan" 
                  cacheManagerName="default"/>
            <transport clusterName="default" distributedSyncTimeout="600000">
               <properties>
                  <property name="configurationFile" value="jgroups-udp-custom.xml" />
               </properties>
            </transport>
         </global>
      
         <default>
            <transaction
                transactionManagerLookupClass="org.infinispan.transaction.lookup.GenericTransactionManagerLookup"
                transactionMode="TRANSACTIONAL" />
            <jmxStatistics enabled="true"/>
      
            <clustering mode="distribution">
               <l1 enabled="false" />
               <hash numOwners="3" numSegments="512" />
               <sync replTimeout="60000"/>
            </clustering>
            <locking lockAcquisitionTimeout="3000" concurrencyLevel="1000" />
         </default>
         
         <namedCache name="testCache" />
         <namedCache name="memcachedCache" />
      
      </infinispan>
      

      Test scenario (description of RadarGun's job):

      <bench-config>
      
         <master bindAddress="${127.0.0.1:master.address}" port="${2103:master.port}" />
      
         <benchmark initSize="${8:slaves}" maxSize="${8:slaves}" increment="1">
            <DestroyWrapper runOnAllSlaves="true" />
            <StartCluster
               staggerSlaveStartup="true"
               delayAfterFirstSlaveStarts="5000"
               delayBetweenStartingSlaves="500" />
            <ClusterValidation
               partialReplication="false" />
            <StartBackgroundStats
               numThreads="10"
               numEntries="${1000:numEntries}"
               entrySize="1024"
               puts="1"
               gets="2"
               statsIterationDuration="${1000:statsIterationDuration}"
               delayBetweenRequests="100"
               transactionSize="${30:transactionSize}"
               startStressors="true" />
            <!-- Synchronously start stat threads -->
            <StartBackgroundStats
               startStats="true" />
            <Sleep
               time="120000" />
            <Kill
               slaves="1,2" />
            <Sleep
               time="120000" />
            <StartCluster
               slaves="1,2"
               staggerSlaveStartup="false" />
            <Sleep
               time="120000" />
            <StopBackgroundStats />
            <ReportBackgroundStats />
         </benchmark>
      
         <products>
            <infinispan52>
                <config name="distributed-udp-numowners-3.xml" cache="testCache"/>
            </infinispan52>
         </products>
      
         <reports />
      
      </bench-config>
      

      If any further information is needed, let me know.

              anistor Adrian Nistor (Inactive)
              mgencur Martin Gencur
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: