Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-4743

Rebalance can hang after the coordinator and another node leave

    XMLWordPrintable

Details

    Description

      This caused a failure in ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus.

      When the coordinator changes, the new coordinator first sends a CacheTopologyControlCommand(type=CH_UPDATE) to reset any ongoing rebalance, then a CacheTopologyControlCommand(type=REBALANCE_START) to start a new rebalance with the remaining members. If another node leaves afterwards, the coordinator sends yet another CacheTopologyControlCommand(type=CH_UPDATE) to remove the leaver from the CHs.

      If one node (in this case the coordinator itself) processes the last CH_UPDATE before the other two commands, it will fail to confirm the rebalance, and the cache will stay in "rebalancing" state until another node joins or leaves.

      Attachments

        Activity

          People

            dberinde@redhat.com Dan Berindei (Inactive)
            dberinde@redhat.com Dan Berindei (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: