Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-4743

Rebalance can hang after the coordinator and another node leave


      This caused a failure in ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus.

      When the coordinator changes, the new coordinator first sends a CacheTopologyControlCommand(type=CH_UPDATE) to reset any ongoing rebalance, then a CacheTopologyControlCommand(type=REBALANCE_START) to start a new rebalance with the remaining members. If another node leaves afterwards, the coordinator sends yet another CacheTopologyControlCommand(type=CH_UPDATE) to remove the leaver from the CHs.

      If one node (in this case the coordinator itself) processes the last CH_UPDATE before the other two commands, it will fail to confirm the rebalance, and the cache will stay in "rebalancing" state until another node joins or leaves.

            dberinde@redhat.com Dan Berindei (Inactive)
            dberinde@redhat.com Dan Berindei (Inactive)
            0 Vote for this issue
            1 Start watching this issue