-
Bug
-
Resolution: Done
-
Major
-
14.0.17.Final
-
None
If a cache encounters a fatal exception on the server, an exception is thrown, the Server stops all other caches and the Server terminates with a FATAL exception. However, when other caches are stopped we only make a call to EmbeddedCacheManager#stop which means that the Caches' state is never persisted. Consequently, if another cache manages to form a cluster before the exception is thrown and it has PartitionHandling.DENY_READ_WRITES configured, then on node restart it is never possible for the cluster to become AVAILABLE again as the UUID of the restarted node will differ from the original.
The org.infinispan.LOCKS cache utilises PartitionHandling.DENY_READ_WRITES, therefore any code attempting to utilise a Lock will fail even if the server correctly startups on restart.
This issue was encountered in the Operator testsuite because a single node was failing due to ISPN-15089 and k8s automatically restarts the server pod on failure. Once the cluster successfully forms, attempts to perform a Backup Restore fail with:
ISPN000136: Error executing command GetKeyValueCommand on Cache 'org.infinispan.LOCKS', writing keys [] org.infinispan.partitionhandling.AvailabilityException: ISPN000306: Key 'ClusteredLockKey{name=BackupManagerImpl-restore}' is not available. Not all owners are in this partition
As a cache needs to lose at least half it's members, or all owners of a segment, for a cluster to be affected by this issue it must meet one of the following:
- Cluster only has 2 nodes
- Cluster has > 2 nodes, but a cache has num_owners=1
- causes
-
ISPN-16015 Consistent strategy throws NPE for node joining during partition
- In Progress
- is related to
-
ISPN-5290 Better automatic merge for caches with enabled partition handling
- To Do