Local reads are fast enough, but the additional interceptors and stage callbacks in (transactional) replicated mode seem to impact with the async interceptor stack a lot more than the classic one.
One thing that's different with the new interceptors is that invokeNext() doesn't call command.acceptVisitor(nextInterceptor) directly. Instead it calls nextInterceptor.visitCommand(), and the interceptor decides whether to use double-dispatch (by extending DDAsyncInterceptor) or another strategy.
In theory this allows us to use simpler interceptors, e.g. having just the methods visitReadCommand(), visitWriteCommand(), and visitTxCommand(). CallInterceptor already calls command.perform() for each command. For now, however, most interceptors extend DDAsyncInterceptor, and tx replicated reads are slower than in 9.0.0.Alpha0.
With transactions, the VisitableCommand.acceptVisitor( call site in DDAsyncInterceptor.visitCommand is megamorphic (since the initial preload uses put, prepare, and commit). Adding a special check in invokeNext() to invoke command.acceptVisitor(nextInterceptor) didn't help, but adding a special check for GetKeyValueCommand made a big difference on my machine:
|9.0.0.Alpha0 (CommandInterceptor)||4937351.255 ±(99.9%) 61665.164 ops/s|
|9.0.0.Beta1 (AsyncInterceptor)||4387466.151 ±(99.9%) 78665.887 ops/s|
||4247769.260 ±(99.9%) 133767.371 ops/s|
|master||4710798.986 ±(99.9%) 166062.177 ops/s|
|master with GKVC special case||5749357.895 ±(99.9%) 87338.878 ops/s|