-
Enhancement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
At the moment there is no detection whether a node which join a cluster is one of the nodes which are known from the "last stable view" or not.
This will have the drawback that the cluster will be still in DEGRADED_MODE if there are some nodes restarted during the split-brain.
Assuming the cluster split is a power failure of some nodes the available nodes are DEGRADED as >=numOwners are lost.
If the failed nodes are restarted, let's say we have an application which use library mode in EAP, these instances are now identified as new nodes as the node-ID's are different.
If these nodes join the 'cluster' all the nodes are still degraded as the restarted are now known as different nodes and not as the lost nodes, so the cluster will not heal and come back to AVAILABLE.
There is a way to prevent some of the possibilities by using server hinting to ensure that at least one owner will survive.
But there are other cases where it would be good to have a different strategy to get the cluster back to AVAILABLE mode.
During the split-brain there is no way to continue as there is no possiblity to know whether "the other" part is gone or still acessable but not seen.
For a shared persistence it might be possible but there is a huge drawback for normal working state to synchronize that with locking and version columns.
If the node ID can be kept I see the following enhancements:
- with a shared persistence there should no data lost, if all nodes are back in the cluster it can go AVAILABLE and reload the missing entries
- for a 'side' cache the values are calculated or retrieved from other (slow) systems, so the cluster can be AVAILABLE and reload the entries
- In other cases there might be a WARNING/ERROR that all members are back from split, there is maybe some data lost and automaticaly or manually set back to AVAILABLE
It might be complicated to calculate this modes, but a configuration for partition-handling might give the possibility to the administrator to decide which behaviour is apropriate for a cache
i.e.
<partition-handling enabled="true" healing="HEALING.MODE"/>
where modes are
AVAILABLE_NO_WARNING back to available after all nodes from "last stable" are back
AVAILABLE_WARNING_DATALOST dto. but log a warning that some DATA can be lost
WARNING_DATALOST only a warning and a hint how to enable manually
NONE same as current behaviour (if necessary, maybe WARNING_DATALOST is similar or better)
- relates to
-
ISPN-15191 Cache startup failures on individual nodes can cause other caches to enter DEGRADED mode on restart
- Resolved
-
ISPN-7800 Cluster always in Degraded Mode
- Closed