-
Bug
-
Resolution: Done
-
Critical
-
15.0.0.Final
-
None
The issue might happen when nodes start concurrently. When one member is joining a cache and the JGroups view changes, the state transfer wrongly uses the new joiner as an existing member, causing data not to transfer correctly. This happens because the joiner will retry the request in the new view.
On start, the node sends the join request to the coordinator. The coordinator handles the request, starts a rebalance, and updates the topology to contain the pending consistent hash. However, if the JGroups view changes, the joiner resends the join request and receives back the topology with a pending consistent hash. This causes the state transfer to consider the new node is already part of the topology and is a state donor.
If the view changes before the node receives the join response, the issue happens consistently, regardless of whether it is configured to DIST or REPL cache. An example for each, where Node 0 and 1 start first, Node 2 sends the join request, and Node 3 starts concurrently, changing the view, we have:
In REPL:
Node 0: size 100/100 Node 1: size 100/100 Node 2: size 0/100 Node 3: size 66/100
In DIST:
Node 0: size 80/100 Node 1: size 82/100 Node 2: size 48/100 Node 3: size 69/100
Any subsequent node that joins will also have missing data from the transfer. To summarize, the issue has a tight window to occur:
- The coordinator already processed the join request;
- The view updates on the joiner before the join response is handled.
Also, keep in mind that the problem affects only the single cache the node is joining. Other caches will continue to work correctly.
The only workaround I can think of to mitigate this would be starting nodes in an ordered way. This is especially difficult for embedded users. It would require on the client application a mechanism to scale up by one, and only after the previous node is guaranteed to have joined the caches.
- is cloned by
-
JDG-6918 View change during a cache join can lead to not replicating data
- Closed