Loading...

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 8.2.0.Beta1
Affects Version/s: 8.1.0.Final
Component/s: Core, Test Suite
Labels:
None

Git Pull Request:
https://github.com/infinispan/infinispan/pull/3962

When the last owner of a segment leaves the cache, the coordinator will update the consistent hash and replace that owner with numOwners owners (so that a segment always has at least 1 owner). If there is a rebalance in progress, it could be that both the current and the pending CH lost all the owners of a segment, and the coordinator will assign new owners in both CHs (not necessarily the same).

Sometimes, this causes tests that create clusters with many nodes to spend a lot of time shutting down the cluster. Here's an example:

Cluster ABCDE, coordinator A, topology id = 0, currentCH = {0: CD, 1: BC}, pendingCH = null
D leaves
A broadcasts a REBALANCE_START command with topology id 1, members = ABCE, currentCH = {0: C, 1: BC}, pendingCH = {0: BC, 1: BC}
A and E confirm that they finished the rebalance
C leaves before sending the data for segment 0 to B
A broadcasts a CH_UPDATE command with topology id 2, members = ABE, currentCH = {0: AE, 1: B}, pendingCH = {0: B, 1: B}
A now owns segment 0 in the writeCH (which is the union of currentCH and pendingCH).
A tries to request segment 0 from the other owner in the currentCH, E
B confirms that it finished the rebalance
A broadcasts a new topology: topology id 3, currentCH = {0: B, 1: B}, pendingCH = null
E installs topology 3, and throws an IllegalArgumentException when handling A's request for segments
A is not able to install topology 3, because it requests the transactions data while holding the lock on the LocalCacheStatus
A receives the IllegalArgumentException from E and retries. But because it still has the old topology, it retries on E ad infinitum - using a lot of CPU in the process.

A requesting segment 0 from E is not a problem in itself - normally E would just send back an empty set of transactions and entries. The problem is that the cluster is able to install a new topology, because A already confirmed receiving all the data, but A is stuck with the old topology.

Assignee:: Dan Berindei (Inactive)

Reporter:: Dan Berindei (Inactive)

Archiver:: Amol Dongare

Created:: 2016/01/26 11:27 AM

Updated:: 2020/02/07 5:59 AM

Resolved:: 2016/01/29 11:34 AM

Archived:: 2024/11/28 6:21 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty