20:14

Question regarding infinispan client/server communication using a remote infinispan cache. When a lock timeout occurs (ISPN000136, ISPN000299), does the infinispan server notify the client it's request timed out/deadlocked or is it a quiet failure?

Having an issue where our client side timeouts were modified and looks like we're overloading infinispan with retries, however I'm trying to understand if the server can notify the client it's transaction failed, or if there's a "smart" way to tell the client not to retry a failed call, vs. set client timeouts higher or reduce/remove retries..
20:24

It only retries when the exception is IOException, TransportException, RemoteNodeSuspectException or RemoteIllegalLifecycleStateException. A lock timeout exception doesn't extend any of those, so it shouldn't be causing a retry - it would be directly propagated to the user code.
20:24

https://github.com/infinispan/infinispan/blob/master/client/hotrod-client/src/main/java/org/infinispan/client/hotrod/impl/operations/RetryOnFailureOperation.java#L159

The retry, if happening, we think is being triggered by a timeout in our client side infinispan-spring client. So while not an internal infinispan server side retry, i'm trying to understand the behavior between client and server when these timeouts happen on the backend. Inadvertently we ended up deploying "infinispan.remote.socketTimeout and connectTimeout on our clients at 3s, when the infinispan backend timeout is 10sec. Since then we've seen constant lock contention and timeout issues.

Per the https://github.com/infinispan/infinispan-spring-boot/blob/71bdb3fba3a4bf0ae78bea6528a3f38c64564f8e/infinispan-spring-boot-starter-remote/src/test/java/test/org/infinispan/spring/starter/remote/ApplicationPropertiesTest.java - the infinispan start is retrying 10 times by default if it's timeout threshold is exceeded.

What we think we're seeing is a thundering heard event once we cross a threshold of timeouts/retries which the backend cannot keep up with and eventually the cluster goes down and stops responding (possibly because the client is still retrying/connecting and sending requests).
20:58

In general, is there any guidance around client side timeout settings appropriate for server side defaults, or general rules of thumb to follow like keep the client side timeout > than server side lock timeout, limit retries or at least have better retry logic?
09:02

The most important thing is to have client side timeout > than server side lock timeout, so the client waits long enough to see the error from the server
09:04

If infinispan.remote.socketTimeout is smaller than your lock timeout, the client is never going to see the lock timeouts, and it's going to keep retrying
09:06

You can reduce the number of retries as well, and we should probably have a lower default, but anything greater than 1 means when the server slows down you'll hit it with more requests and make it even slower
17:52

So indeed returning to the default of 60 seems to have taken care of the issue, however even at 30s we were seeing the same behavior, albeit not nearly as bad, as we did at 3s. Interestingly, it would only exhibit itself after 4-6 hours of running with production load at 30s, much more quickly at 3s (like 2-3 minutes). I'm trying to understand what the underlying behaviors are that might lead to this condition and eventual timeout condition. We are unable to reproduce this behavior through load testing or in any lower environment. It's only occurred in production, at production volumes, and only after 4-6 hours of running that way without logging a single error.

We're hesitant to enable to much logging, metrics or stats in prod given the added overhead. We run a fairly high volume site w/ about 3-5k logins per minute at peak volume. trying to understand what would cause a lock request to take >3s, or even >30s. We are using spring-boot infinispan starter w/ infinispan/hotrod client v9.3.1 and server v9.3.1 in a distributed cache. Is there an accepted way to monitor lock contention, lock acquisition timing, etc? Google isn't turning up much.

The documentation on this topic seems to be contradictory as well. The official documentation states different defaults than the actual spec doc, and refers to settings that have been deprecated.

clustered.xml
17:58

The docs need to be amended. @Don Naro ^
18:00

@Tom Hudak thanks for reporting that. I'll check it out and create an issue. will update shortly.
18:01

Specifically the sections on locking isolation defaults, (READ_COMMITTED vs. REPEATABLE_READ), and the deadlock-spin setting which was deprecated.

@Don Naro Thanks!
18:02

and unfortunately we do not keep track of lock timings. We do want to introduce some better stats/metrics in the near future, including breaking down individual ops by "area" (locking, persistence, networking) so that we can analyse things like this
18:09

Thanks @Tristan

Any thoughts off hand on what I might be able to monitor to get an indication of something "going wrong" behind the scenes before it happens? We're not seeing any clear indicator eg memory spikes, cpu spikes, gc's, excessive gc's, heap exhaustion, network socket exhaustion, or cpu contention.

We do keep a close eye on threads, network, heap, garbage collection, and some jmx metrics but we're a little blind to the

inner workings outside of logging..

Thanks,
-Tom
19:05

I would definitely monitor thread usage
20:04

Looking at some possible configuration improvements. These are all distributed caches, however only one is configured for "READ_COMMITTED" w/ OPTIMISTIC locking, and according to the docs (as best I can infer) the defaults are REPEATABLE_READ and PESSIMISTIC (which enables non_xa by default?)

We run a 3 node cluster of pods in openshift (3.7) with the attached config.

I'm considering the following two changes as important tunables, but don't really know how to quantify their impact. Hoping someone could shed some light.

Considering we average about 300-400 concurrent threads at any given time in our infinispan JVM's:

Change from defaults to "READ_COMMITTED" and "OPTIMISTIC" locking for remaining caches.
Change concurrency-level from default 32 to something higher than the total threads eg. <locking concurrency-level="500">

Change lockacquisitiontimeout from 10000 to a smaller value. 1000? 500? In order to reduce wait, fail locks faster, and to reduce the client side timeout settings safely to a lower value than 1min.

Client side: Change "maxRetries" to a lower level like 3

Thoughts?
infinispan > Lock Timeout Question
14:08

<<< Lock timeout question >>>

   Looking at some possible configuration improvements. These are all distributed caches, however only one is configured for "READ_COMMITTED" w/ OPTIMISTIC locking, and according to the docs (as best I can infer) the defaults are REPEATABLE_READ and PESSIMISTIC (which enables non_xa by default?)

Not sure about the rest of the documentation, but this document is generated from the code so it should be up to date:
https://docs.jboss.org/infinispan/9.3/configdocs/infinispan-config-9.3.html

The default is to not use transactions at all, in which case the isolation level and locking mode are both ignored.

@Tristan I think we could log a warning if these attributes are modified and transactions are disabled, WDYT?

   Change from defaults to "READ_COMMITTED" and "OPTIMISTIC" locking for remaining caches.

Assuming you also want to enable transactions, I do not recommend this combination, because it breaks atomic operations: 2 transactions can modify the same set of key in parallel and the last write wins.

   Change concurrency-level from default 32 to something higher than the total threads eg. <locking concurrency-level="500">

concurrency-level shouldn't affect locking, as you don't have lock striping enabled.

   Change lockacquisitiontimeout from 10000 to a smaller value. 1000? 500? In order to reduce wait, fail locks faster, and to reduce the client side timeout settings safely to a lower value than 1min.

1000ms sounds reasonable, but I wouldn't use 500ms unless you know your GC pauses are always below that.

   Client side: Change "maxRetries" to a lower level like 3

Sounds good.
14:19

   @Tristan I think we could log a warning if these attributes are modified and transactions are disabled, WDYT?

you mean add something in the validation of the builder:
if (attributes.isModified() && transactionMode() != TransactionMode.TRANSACTIONAL) log.warn
16:27

fyi, I've created an issue to track the doc changes: https://issues.jboss.org/browse/ISPN-9560 I'll look to work on this next week or the following sprint.
16:50

Exactly @Tristan