-
Bug
-
Resolution: Done
-
Major
-
9.2.0.Final
-
None
PreferAvailabilityStrategy checks the size of the stable topology, and only considers cache topologies that are derived from the biggest topology (in size) when picking a post-merge topology.
Unfortunately, in some situations this algorithm fails pretty badly. If a node has a very long GC pause, when it comes back it will report the old topology and the old stable topology. If the rest of the cluster rebalanced, it now has both a smaller current topology and a smaller stable topology.
Furthermore, the stable topology is updated asynchronously, independent from the current topology. So even if there's a split and the minority partition installs a current topology with fewer members, it may take some time for its stable topology to be updated with fewer members. In fact, it appears that when a rebalance is not needed (e.g. because the partition has a single node), the stable topology is never updated!