Details
-
Bug
-
Resolution: Done
-
Critical
-
6.0.1.Final
Description
Two concurrent rebalances can lead to deadlock. Example situation when two rebalances can be executed in parallel is when the coordinator is leaving a cluster; it sends REBALANCE_START and passes away. Then, the new coordinator recovers cluster status and sends REBALANCE_START as well.
1. Node is requesting segments for the old topology, StateConsumerImpl.isTransferThreadRunning is set to true
2. Node waits for StateResponseCommand in SCI: InboundTransferTask.awaitCompletion()
3. New rebalance is started, changing the CH - requested segment is not in the new CH
4. Some ST are canceled, the cancel command is sent and taking a long time
5. StateReponseCommand is received, but in SCI.applyState it is found out that this segment is no longer owned so the task is not completed/cancelled
6. Later, we get TimeoutException from InboundTransferTask.sendCancelCommand, and no more cancellations are executed
Result: the inbound transfer thread is stuck and rebalance is never completed.