In replicated HA scenarios I can see the replication is broken because of [1].
This issue was already discussed in JBEAP-4742, see comments. As a solution the timeout was made configurable. You can configure it using call-timeout in cluster-connection.
I have seen this issue in our CI but I have suspected it is an environment issue caused by slow NFS. However I dug into this a bit more. Here are my findings.
It seems that something hangs the synchronization process because increasing of call-timeout doesn't help.
I have tracked sending and receiving of synchronization packets in trace logs. There is 60s window in which no packet is handled or sent. Hanging packets are received after the [1] is printed to log and replication is canceled.
When I set call-timeout to 2 minutes, replication fails because of connection timeout error.
I can easily reproduce the issue in our CI, but I can't reproduce it locally on my laptop. Maybe there is some race condition which reveals only in slower environment.
I can see the same issue with 7.0.x.
Tip for debug: On both servers there is one thread which takes care about sending/handling replication packets. You can track these threads in trace logs, see attachment.
[1]
10:43:58,180 WARN [org.apache.activemq.artemis.core.server] (Thread-131) AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.
Customer impact: Replication between Live and Backup may fail and the process is not restored automatically. Admin has to identify such situation and restart server which acts as Backup.
- clones
-
JBEAP-7968 (7.1.0) The backup server is not responding promptly introducing latency beyond the limit.
- Closed
- incorporates
-
ENTMQBR-556 Sync won't catch up on replication under load
- Closed