I've noticed that runnable tasks submitted to the ExecutionService can be run more than once!
This would happen when a member executing the task is suspected and is removed from the cluster. The protocol handles this by resubmitting the request in Executing#handleView. I note the following JavaDoc
// The person currently servicing our request has gone down
// without completing so we have to keep our request alive by
// sending ours back to the coordinator
The problem is that there is no guarantee that the request was not completed.
For my application a member executing the task had a absurdly long GC pause (30s) which meant it was temporarily removed from the cluster, when the pause completed it happily continued executing the task which had already been resubmitted and completed. The task in question involves modifying the database and is quite destructive, so you can imagine the fallout.
I was hoping to get an opinion of if we think this behaviour is correct, especially since we are implementing a standard java,util.concurrent interface. (I couldn't find anything in the ExecutorService JavaDoc to say it is wrong though)
Perhaps there could be control over the behaviour:
- At least once - assumes task failed and resubmits
- At most once - assumes task completed and cleans up - may not actually be complete
- Exactly once - not sure if this is possible
There is a simple change to exhibit the at least once behaviour by ignoring consumers that have left. I have the patch in my fork:
Which my project is using. This can be wrapped in some simple configuration on the protocol to control the behaviour and would be a useful addition.