When a node does a write and it is the primary owner it will send the client listener event. Unfortunately this can block waiting to put in the eventQueue. Unfortunately the thread that is executing this method can be the event loop for that listener, which in turn can cause it to block forever. I believe this came about after
I have found a few ways to fix this at the moment:
- Make the client listener async - this way it is fired on the notification thread
- Change the listener event to check if the current thread is the event loop and subsequently fire the events directly instead of submitting it to the Executor.
- Make the various write operations still use the worker thread pool - this was changed in the above mentioned JIRA so I am guessing we don't want to do that
The former of the fixes seems to be much more performant, but I don't know if I like that fix. The latter still has hiccups since it reduces availability of threads to respond to socket requests.
This is very easy to reproduce by running the ClientEventStressTest test on master as is since it only has 3 io threads (but the default is CPU * 2 - so in some cases this isn't that much)