Uploaded image for project: 'Red Hat Data Grid'
  1. Red Hat Data Grid
  2. JDG-2518

Cache startup failure with server hinting and insufficient segments

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Verified (View Workflow)
    • Priority: Major
    • Resolution: Done
    • Affects Version/s: JDG 7.2.3 GA, JDG 7.3 ER3
    • Fix Version/s: JDG 7.3.1 ER2
    • Component/s: Clustering
    • Labels:
      None
    • Target Release:
    • Steps to Reproduce:
      Hide

      1. set segments to 1, and add machine setting to transport.

      <distributed-cache name="default" segments="1" />
      ...
      <stack name="udp">
          <transport type="UDP" socket-binding="jgroups-udp" machine="${jboss.jgroups.transport.machine:machine1}" rack="${jboss.jgroups.transport.rack:rack1}" site="${jboss.jgroups.transport.site:site1}" />
      </stack>
      

      2. startup 3 nodes.
      3. the 3rd node will fail with Replication timeout by state-transfer timeout.

      This log and clustered.xml was attached as log.zip.

      Show
      1. set segments to 1, and add machine setting to transport. <distributed-cache name= " default " segments= "1" /> ... <stack name= "udp" > <transport type= "UDP" socket-binding= "jgroups-udp" machine= "${jboss.jgroups.transport.machine:machine1}" rack= "${jboss.jgroups.transport.rack:rack1}" site= "${jboss.jgroups.transport.site:site1}" /> </stack> 2. startup 3 nodes. 3. the 3rd node will fail with Replication timeout by state-transfer timeout. This log and clustered.xml was attached as log.zip.
    • Workaround:
      Workaround Exists
    • Workaround Description:
      Hide

      increasing segment

      Show
      increasing segment
    • Affects:
      Release Notes
    • Sprint:
      JDG Sprint #25

      Description

      When setting small segment to a cache and using server hinting, node can't start with the following error[1].
      It can be reproduced with RHDG 7.2.3 and 7.3 ER2.

      [1]

      ERROR [org.jboss.msc.service.fail] (MSC service thread 1-4) MSC000001: Failed to start service jboss.datagrid-infinispan.clustered.test: org.jboss.msc.service.StartException in service jboss.datagrid-infinispan.clustered.test: Failed to start service
      ...
      Caused by: org.infinispan.commons.CacheException: Unable to invoke method public void org.infinispan.statetransfer.StateTransferManagerImpl.start() throws java.lang.Exception on object of type StateTransferManagerImpl
      ...
      Caused by: org.infinispan.util.concurrent.TimeoutException: Replication timeout for svr01 (flags=0), site-id=site1, rack-id=rack1, machine-id=machine1)
      at org.infinispan.remoting.transport.jgroups.JGroupsTransport.checkRsp(JGroupsTransport.java:916)
      ...
      

      For example, 3rd node will fail to start with the following setting in 3 nodes cluster.
      When set the segments to 20 (6.6.2 default), 6th node will fail to start with the above timeout.
      Nodes seems to not be able to finish the initial state transfer and start up fails if the segments are set insufficiently against the number of nodes,

      <distributed-cache name="default" segments="1" />
      ...
      <stack name="udp">
          <transport type="UDP" socket-binding="jgroups-udp" machine="${jboss.jgroups.transport.machine:machine1}" rack="${jboss.jgroups.transport.rack:rack1}" site="${jboss.jgroups.transport.site:site1}" />
      </stack>
      

        Attachments

        1. logs.zip
          16 kB
        2. reproducer.zip
          150 kB

          Issue Links

            Activity

              People

              Assignee:
              dberinde@redhat.com Dan Berindei
              Reporter:
              rhn-support-hdaicho Hiroki Daicho (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: