Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-4743

Rebalance can hang after the coordinator and another node leave

    XMLWordPrintable

Details

    Description

      This caused a failure in ClusterTopologyManagerTest.testAbruptLeaveAfterGetStatus.

      When the coordinator changes, the new coordinator first sends a CacheTopologyControlCommand(type=CH_UPDATE) to reset any ongoing rebalance, then a CacheTopologyControlCommand(type=REBALANCE_START) to start a new rebalance with the remaining members. If another node leaves afterwards, the coordinator sends yet another CacheTopologyControlCommand(type=CH_UPDATE) to remove the leaver from the CHs.

      If one node (in this case the coordinator itself) processes the last CH_UPDATE before the other two commands, it will fail to confirm the rebalance, and the cache will stay in "rebalancing" state until another node joins or leaves.

      Attachments

        Activity

          Public project attachment banner

            context keys: [headless, issue, helper, isAsynchronousRequest, project, action, user]
            current Project key: ISPN

            People

              dberinde@redhat.com Dan Berindei
              dberinde@redhat.com Dan Berindei
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: