In our application we had the problem that under high load a Cache#remove(key) invocation returned the old value although the same key was already removed on another node a few milliseconds before.
This occurred while another node was shutdown - that node happened to be the coordinator and also the primary owner of the key here.
The cache is a distributed, DB-persisted, transactional (locking mode Pessimistic) cache with numOwners=1.
I could reproduce such an issue with the attached unit test (tested on ISPN 7.1).
- MultipleCacheManagersTest with 3 nodes
- Cache is distributed, persisted, transactional (locking mode Pessimistic) with numOwners 1 or 2 (see below; getTestCacheNumOwners() in the test)
- a number of test keys get inserted; only keys with the coordinator as primary owner get used for the test
- while the keys get removed one-by-one and concurrently by 2 threads...
- ... the coordinator node gets stopped
At the end, the return values of the remove() invocations get checked.
remove() invocations that threw an error (SuspectException) are ignored.
The following problems occurred on some test runs:
- the remove(key) invocations on both nodes returned null and the entry was still there in the end
- => I would have expected an exception here as remove(key) essentially failed
- the remove(key) invocations on both nodes returned null but the entry was actually removed
- => wrong return value of first invocation
- the remove(key) invocations on both nodes returned both the old value
- => wrong return value of second invocation
These issues occur not always but quite often with the test.
When testing with numOwners=2, I only got the last of the above issues.
Attached are log files from tests with numOwners=1
("org.infinispan.transaction" log level set to TRACE):
- ~40% of test runs produced:
issue 3 ("2 remove() invocations returned the old value for key 'k4'")
- occurred rarely:
issue 2 ("remove() did actually remove the key 'k4', but the return value was null")
- didn't occur with "org.infinispan.transaction" log level set to TRACE; but otherwise quite often:
issue 1 ("remove() of key 'k4' did not succeed, but we got no exception (entry still there; value: 'v')")