Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: RHDG 8.5.0 GA
Affects Version/s: RHDG 8.1.1 GA, RHDG 8.4.6 GA
Component/s: Core
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Target Release:
Git Pull Request:
https://github.com/infinispan/infinispan/pull/12234
Intelligence Requested:
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

The issue might happen when nodes start concurrently. When one member is joining a cache and the JGroups view changes, the state transfer wrongly uses the new joiner as an existing member, causing data not to transfer correctly. This happens because the joiner will retry the request in the new view.

On start, the node sends the join request to the coordinator. The coordinator handles the request, starts a rebalance, and updates the topology to contain the pending consistent hash. However, if the JGroups view changes, the joiner resends the join request and receives back the topology with a pending consistent hash. This causes the state transfer to consider the new node is already part of the topology and is a state donor.

If the view changes before the node receives the join response, the issue happens consistently, regardless of whether it is configured to DIST or REPL cache. An example for each, where Node 0 and 1 start first, Node 2 sends the join request, and Node 3 starts concurrently, changing the view, we have:

In REPL:

Node 0: size 100/100
Node 1: size 100/100
Node 2: size 0/100
Node 3: size 66/100

In DIST:

Node 0: size 80/100
Node 1: size 82/100
Node 2: size 48/100
Node 3: size 69/100

Any subsequent node that joins will also have missing data from the transfer. To summarize, the issue has a tight window to occur:

The coordinator already processed the join request;
The view updates on the joiner before the join response is handled.

Also, keep in mind that the problem affects only the single cache the node is joining. Other caches will continue to work correctly.

The only workaround I can think of to mitigate this would be starting nodes in an ordered way. This is especially difficult for embedded users. It would require on the client application a mechanism to scale up by one, and only after the previous node is guaranteed to have joined the caches.

clones

ISPN-15872 View change during a cache join can lead to not replicating data

Resolved

Assignee:: Jose Bolina

Reporter:: Wolf Fink

Tester:: Anna Manukyan

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/03/15 8:14 AM

Updated:: 2024/09/26 5:18 PM

Resolved:: 2024/04/24 2:22 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates