Details
-
Bug
-
Resolution: Done
-
Blocker
-
7.1.0.DR9, 7.1.0.DR11, 7.1.0.DR12, 7.1.0.DR14, 7.1.0.DR15, 7.1.0.DR16, 7.1.0.DR18, 7.1.0.DR19, 7.1.0.ER1
-
AMQ Sprint 3
Description
In replicated HA scenarios I can see the replication is broken because of [1].
This issue was already discussed in JBEAP-4742, see comments. As a solution the timeout was made configurable. You can configure it using call-timeout in cluster-connection.
I have seen this issue in our CI but I have suspected it is an environment issue caused by slow NFS. However I dug into this a bit more. Here are my findings.
It seems that something hangs the synchronization process because increasing of call-timeout doesn't help.
I have tracked sending and receiving of synchronization packets in trace logs. There is 60s window in which no packet is handled or sent. Hanging packets are received after the [1] is printed to log and replication is canceled.
When I set call-timeout to 2 minutes, replication fails because of connection timeout error.
I can easily reproduce the issue in our CI, but I can't reproduce it locally on my laptop. Maybe there is some race condition which reveals only in slower environment.
I can see the same issue with 7.0.x.
Tip for debug: On both servers there is one thread which takes care about sending/handling replication packets. You can track these threads in trace logs, see attachment.
[1]
10:43:58,180 WARN [org.apache.activemq.artemis.core.server] (Thread-131) AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.
Customer impact: Replication between Live and Backup may fail and the process is not restored automatically. This can happen during initial synchronization between live->backup when backup is started for the first time or after failback. This can be hit when executing proof of concept by user/customer. Admin has to identify such situation and restart server which acts as Backup. Backup will not activate if Live server crashes which will lead to unavailability of service.
Attachments
Issue Links
- is blocked by
-
JBEAP-10030 Upgrade Artemis 1.5.4.jbossorg-002
- Verified
-
JBEAP-10723 Upgrade Artemis 1.5.4.jbossorg-004
- Verified
-
JBEAP-12044 Upgrade Artemis 1.5.5.jbossorg-006
- Verified
- is cloned by
-
JBEAP-8630 (7.0.z) The backup server is not responding promptly introducing latency beyond the limit.
- Resolved
- is incorporated by
-
JBEAP-10030 Upgrade Artemis 1.5.4.jbossorg-002
- Verified
-
JBEAP-10723 Upgrade Artemis 1.5.4.jbossorg-004
- Verified
-
JBEAP-12044 Upgrade Artemis 1.5.5.jbossorg-006
- Verified
- relates to
-
JBEAP-4736 (7.0.z) Live does not become active after failback in replicated topology with http connectors
- Resolved
-
ENTMQBR-556 Sync won't catch up on replication under load
- Closed