-
Bug
-
Resolution: Done
-
Blocker
-
7.1.0.DR9, 7.1.0.DR11, 7.1.0.DR12, 7.1.0.DR14, 7.1.0.DR15, 7.1.0.DR16, 7.1.0.DR18, 7.1.0.DR19, 7.1.0.ER1
-
AMQ Sprint 3
In replicated HA scenarios I can see the replication is broken because of [1].
This issue was already discussed in JBEAP-4742, see comments. As a solution the timeout was made configurable. You can configure it using call-timeout in cluster-connection.
I have seen this issue in our CI but I have suspected it is an environment issue caused by slow NFS. However I dug into this a bit more. Here are my findings.
It seems that something hangs the synchronization process because increasing of call-timeout doesn't help.
I have tracked sending and receiving of synchronization packets in trace logs. There is 60s window in which no packet is handled or sent. Hanging packets are received after the [1] is printed to log and replication is canceled.
When I set call-timeout to 2 minutes, replication fails because of connection timeout error.
I can easily reproduce the issue in our CI, but I can't reproduce it locally on my laptop. Maybe there is some race condition which reveals only in slower environment.
I can see the same issue with 7.0.x.
Tip for debug: On both servers there is one thread which takes care about sending/handling replication packets. You can track these threads in trace logs, see attachment.
[1]
10:43:58,180 WARN [org.apache.activemq.artemis.core.server] (Thread-131) AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.
Customer impact: Replication between Live and Backup may fail and the process is not restored automatically. This can happen during initial synchronization between live->backup when backup is started for the first time or after failback. This can be hit when executing proof of concept by user/customer. Admin has to identify such situation and restart server which acts as Backup. Backup will not activate if Live server crashes which will lead to unavailability of service.
- is blocked by
-
JBEAP-10030 Upgrade Artemis 1.5.4.jbossorg-002
- Closed
-
JBEAP-10723 Upgrade Artemis 1.5.4.jbossorg-004
- Closed
-
JBEAP-12044 Upgrade Artemis 1.5.5.jbossorg-006
- Closed
- is cloned by
-
JBEAP-8630 (7.0.z) The backup server is not responding promptly introducing latency beyond the limit.
- Resolved
- is incorporated by
-
JBEAP-10030 Upgrade Artemis 1.5.4.jbossorg-002
- Closed
-
JBEAP-10723 Upgrade Artemis 1.5.4.jbossorg-004
- Closed
-
JBEAP-12044 Upgrade Artemis 1.5.5.jbossorg-006
- Closed
- relates to
-
JBEAP-4736 (7.0.z) Live does not become active after failback in replicated topology with http connectors
- Resolved
-
ENTMQBR-556 Sync won't catch up on replication under load
- Closed