-
Bug
-
Resolution: Done
-
Critical
-
6.0.2.Final
-
None
This appeared during the 32-nodes elasticity test in the Hyperion environment.
Just as apex947 left, it started a rebalance, which apex948 dutifully cancelled as it became the new coordinator. apex949 had already requested segments from apex959, so it sent a StateRequestCommand(CANCEL_STATE_TRANSFER) asynchronously to apex959. Then apex948 started a new rebalance, and apex949 asked apex959 for the same segments. When apex959 finally received the cancel request, it didn't check the topology id and it incorrectly cancelled the outbound transfer to apex949.
The solution would be to verify the topology id in the CANCEL_STATE_TRANSFER command before cancelling the transfer. I also think we can avoid sending the cancel command completely in this case, and only send it as we are about to stop.
- relates to
-
ISPN-4571 JmxManagementIT.testRpcManagerAttributes random failures
- Closed