Normally, the JGroupsTransport startup sequence goes like this:
- Create the Channel
- Create the CommandAwareRpcDispatcher and install it as an UpHandler
- Connect the channel
This way, every RequestCorrelator message received by the channel is passed up to CommandAwareRpcDispatcher, which executes the appropriate command.
When using a JGroupsChannelLookup, the lookup implementation is allowed to return a Channel instance that is already connected (shouldConnect() == false). That means there is now a window where the channel doesn't have an UpHandler, and messages sent to this node are discarded.
Normally a node only receives commands after it sent a join request to the coordinator. There are however a few exceptions:
- On startup, LocalTopologyManagerImpl sends the join request to the JGroups coordinator, which may not have the UpHandler yet. This seems to be responsible for the recent hanging in ConcurrentStartTest. We have a workaround here, to use a smaller timeout on the CacheTopologyControlCommand(JOIN) command, and retry it on TimeoutException.
- When a node becomes coordinator, ClusterTopologyManagerImpl broadcasts a GET_STATUS request to all cluster members, and expects a response from each of them. The same workaround with a smaller timeout and retries might work here.
- In replicated mode, write commands are broadcasted to all cluster members. There is some commented out code in RpcManagerImpl.invokeRemotelyAsync() that might fix it by only waiting for responses from the cache topology members.
We should consider deprecating JGroupsChannelLookup.shouldConnect() and requiring that the channel is only connected by JGroupsTransport. Assuming that works with ForkChannel, of course.