Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-6402

Default GMS.join_timeout is too long

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XMLWordPrintable

      GMS.join_timeout is used by JGroups for two purposes:

      1. Wait for FIND_INITIAL_MBRS responses. If other nodes are running, but they don't answer within join_timeout ms, the node will start a new partition by itself.
      2. If no other nodes are running when the request is sent, but another node starts and sends its own discovery request within join_timeout, the initial cluster view will contain both nodes, but this isn't really useful in Infinispan (we have gcb.transport().initialClusterSize() instead).
      3. Once a coordinator is located, the node sends a join request and waits for a response for join_timeout ms. After a timeout, the node re-sends the join request (up to a maximum of max_join_attempts, which defaults to 10).

      The default GMS.join_timeout in Infinispan is 15000, vs. 2000 in JGroups (actually 3000 in GMS itself, but 2000 in the example configurations).

      The higher timeout will only help us when a node is running, but it's inaccessible (e.g. because of a long GC) at the exact time a node is joining. I'd argue that applications that can tolerate multi-second pauses would be better served by gcb.transport().initialClusterSize(2) and/or an external discovery mechanism (e.g. FILE_PING, or something based on the WildFly domain controller). For most applications, the current default means just a 15s delay every time the cluster is (re)started.

      In particular, because our integration tests use the default configuration, it means a delay of 15s for every test that starts a cluster.

              dberinde@redhat.com Dan Berindei (Inactive)
              dberinde@redhat.com Dan Berindei (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: