Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-3878

Unhandled failing ST cancel leads to deadlock

    Details

      Description

      Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.

      1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
      2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
      3. New rebalance is started, changing the CH - requested segment is not in the new CH
      4. Some ST are canceled, the cancel command is sent and taking a long time
      5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
      6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed

      Result: the inbound transfer thread is stuck and rebalance is never completed.

        Gliffy Diagrams

          Attachments

            Activity

              People

              • Assignee:
                dan.berindei Dan Berindei
                Reporter:
                rvansa Radim Vansa
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: