Loading...

Type: Bug
Resolution: Done
Priority: Blocker
Fix Version/s: 7.1.0.ER3
Affects Version/s: 7.1.0.DR9, 7.1.0.DR11, 7.1.0.DR12, 7.1.0.DR14, 7.1.0.DR15, 7.1.0.DR16, 7.1.0.DR18, 7.1.0.DR19, 7.1.0.ER1
Component/s: ActiveMQ
Labels:
- KK-DR18
- eap7.1-risks-mitigation

Affects Testing:

Blocks Testing
CDW blocker:
CDW devel_ack:
CDW docs_ack:
CDW pm_ack:
CDW qa_ack:
CDW release:
Target Release:

7.1.0.GA
Git Pull Request:
https://github.com/rh-messaging/jboss-activemq-artemis/pull/200
Steps to Reproduce:
Hide

git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout 50db1a0dcc9eb6c6876c0254a4ce45d569a78ff3 groovy -DEAP_VERSION=7.1.0.DR19 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testFailbackClientAckTopic -DfailIfNoTests=false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1.0.DR19 | tee log
Show
git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout 50db1a0dcc9eb6c6876c0254a4ce45d569a78ff3 groovy -DEAP_VERSION=7.1.0.DR19 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testFailbackClientAckTopic -DfailIfNoTests= false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1.0.DR19 | tee log

Sprint:
AMQ Sprint 3

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In replicated HA scenarios I can see the replication is broken because of [1].

This issue was already discussed in ~~JBEAP-4742~~, see comments. As a solution the timeout was made configurable. You can configure it using call-timeout in cluster-connection.

I have seen this issue in our CI but I have suspected it is an environment issue caused by slow NFS. However I dug into this a bit more. Here are my findings.

It seems that something hangs the synchronization process because increasing of call-timeout doesn't help.

I have tracked sending and receiving of synchronization packets in trace logs. There is 60s window in which no packet is handled or sent. Hanging packets are received after the [1] is printed to log and replication is canceled.

When I set call-timeout to 2 minutes, replication fails because of connection timeout error.

I can easily reproduce the issue in our CI, but I can't reproduce it locally on my laptop. Maybe there is some race condition which reveals only in slower environment.

I can see the same issue with 7.0.x.

Tip for debug: On both servers there is one thread which takes care about sending/handling replication packets. You can track these threads in trace logs, see attachment.

[1]

10:43:58,180 WARN  [org.apache.activemq.artemis.core.server] (Thread-131) AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.

Customer impact: Replication between Live and Backup may fail and the process is not restored automatically. This can happen during initial synchronization between live->backup when backup is started for the first time or after failback. This can be hit when executing proof of concept by user/customer. Admin has to identify such situation and restart server which acts as Backup. Backup will not activate if Live server crashes which will lead to unavailability of service.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Hide
analyzer.zip
2017/02/02 7:03 AM
4 kB
Erich Duda
Extracting archive...
Show
analyzer.zip
2017/02/02 7:03 AM
4 kB
Erich Duda
Hide
logs.zip
2017/02/02 7:06 AM
5.75 MB
Erich Duda
Extracting archive...
Show
logs.zip
2017/02/02 7:06 AM
5.75 MB
Erich Duda
node-2-thread-dump-11-06-11.txt
2017/02/02 7:06 AM
166 kB
Erich Duda
packet-analysis.png
2017/02/02 7:04 AM
11.15 MB
Erich Duda

is blocked by

JBEAP-10030 Upgrade Artemis 1.5.4.jbossorg-002

Closed

JBEAP-10723 Upgrade Artemis 1.5.4.jbossorg-004

Closed

JBEAP-12044 Upgrade Artemis 1.5.5.jbossorg-006

Closed

is cloned by

JBEAP-8630 (7.0.z) The backup server is not responding promptly introducing latency beyond the limit.

Resolved

is incorporated by

JBEAP-10030 Upgrade Artemis 1.5.4.jbossorg-002

Closed

JBEAP-10723 Upgrade Artemis 1.5.4.jbossorg-004

Closed

JBEAP-12044 Upgrade Artemis 1.5.5.jbossorg-006

Closed

relates to

JBEAP-4736 (7.0.z) Live does not become active after failback in replicated topology with http connectors

Resolved

ENTMQBR-556 Sync won't catch up on replication under load

Closed

(2 is incorporated by, 2 relates to)

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates