When a node becomes coordinator, it sends a GET_STATUS command to recover the status of all the caches in the cluster. Then it sends a CH_UPDATE command for each cache, to install a new topology.
However, if the CH_UPDATE command fails for one cache, the coordinator will interrupt the recovery process, and it won't even install the cache topology for the rest of the caches locally. When a new node joins, it will act as if the cache was just started:
This is mitigated by the
ISPN-2966 fix (https://github.com/infinispan/infinispan/pull/1742). Since the CH_UPDATE command is now sent asynchronously, it ignores any exceptions on the targets. But on the 5.2.x branch, the CH_UPDATE command is sent synchronously, and it can fail with an exception like this:
If that happens, cluster recovery is interrupted and some caches may end up in a corrupted state where it's impossible for new members to join.