-
Task
-
Resolution: Done
-
Minor
-
None
-
None
GMS.join_timeout is used by JGroups for two purposes:
- Wait for FIND_INITIAL_MBRS responses. If other nodes are running, but they don't answer within join_timeout ms, the node will start a new partition by itself.
- If no other nodes are running when the request is sent, but another node starts and sends its own discovery request within join_timeout, the initial cluster view will contain both nodes, but this isn't really useful in Infinispan (we have gcb.transport().initialClusterSize() instead).
- Once a coordinator is located, the node sends a join request and waits for a response for join_timeout ms. After a timeout, the node re-sends the join request (up to a maximum of max_join_attempts, which defaults to 10).
The default GMS.join_timeout in Infinispan is 15000, vs. 2000 in JGroups (actually 3000 in GMS itself, but 2000 in the example configurations).
The higher timeout will only help us when a node is running, but it's inaccessible (e.g. because of a long GC) at the exact time a node is joining. I'd argue that applications that can tolerate multi-second pauses would be better served by gcb.transport().initialClusterSize(2) and/or an external discovery mechanism (e.g. FILE_PING, or something based on the WildFly domain controller). For most applications, the current default means just a 15s delay every time the cluster is (re)started.
In particular, because our integration tests use the default configuration, it means a delay of 15s for every test that starts a cluster.
- is related to
-
WFLY-1066 Automatic configuration of 'Initial_hosts' for a cluster using JGroups TCP-stack in domain mode (aka DOMAIN_PING)
- Open