Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-7968

(7.1.0) The backup server is not responding promptly introducing latency beyond the limit.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • 7.1.0.ER3
    • 7.1.0.DR9, 7.1.0.DR11, 7.1.0.DR12, 7.1.0.DR14, 7.1.0.DR15, 7.1.0.DR16, 7.1.0.DR18, 7.1.0.DR19, 7.1.0.ER1
    • ActiveMQ
    • Blocks Testing
    • Hide
      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      git checkout 50db1a0dcc9eb6c6876c0254a4ce45d569a78ff3
      groovy -DEAP_VERSION=7.1.0.DR19 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      
      cd ../jboss-hornetq-testsuite/
      
      mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testFailbackClientAckTopic -DfailIfNoTests=false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1.0.DR19 | tee log
      
      Show
      git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout 50db1a0dcc9eb6c6876c0254a4ce45d569a78ff3 groovy -DEAP_VERSION=7.1.0.DR19 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testFailbackClientAckTopic -DfailIfNoTests= false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1.0.DR19 | tee log
    • AMQ Sprint 3

      In replicated HA scenarios I can see the replication is broken because of [1].

      This issue was already discussed in JBEAP-4742, see comments. As a solution the timeout was made configurable. You can configure it using call-timeout in cluster-connection.

      I have seen this issue in our CI but I have suspected it is an environment issue caused by slow NFS. However I dug into this a bit more. Here are my findings.

      It seems that something hangs the synchronization process because increasing of call-timeout doesn't help.

      I have tracked sending and receiving of synchronization packets in trace logs. There is 60s window in which no packet is handled or sent. Hanging packets are received after the [1] is printed to log and replication is canceled.

      When I set call-timeout to 2 minutes, replication fails because of connection timeout error.

      I can easily reproduce the issue in our CI, but I can't reproduce it locally on my laptop. Maybe there is some race condition which reveals only in slower environment.

      I can see the same issue with 7.0.x.

      Tip for debug: On both servers there is one thread which takes care about sending/handling replication packets. You can track these threads in trace logs, see attachment.

      [1]

      10:43:58,180 WARN  [org.apache.activemq.artemis.core.server] (Thread-131) AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.
      

      Customer impact: Replication between Live and Backup may fail and the process is not restored automatically. This can happen during initial synchronization between live->backup when backup is started for the first time or after failback. This can be hit when executing proof of concept by user/customer. Admin has to identify such situation and restart server which acts as Backup. Backup will not activate if Live server crashes which will lead to unavailability of service.

        1. analyzer.zip
          4 kB
        2. packet-analysis.png
          packet-analysis.png
          11.15 MB
        3. logs.zip
          5.75 MB
        4. node-2-thread-dump-11-06-11.txt
          166 kB

            csuconic@redhat.com Clebert Suconic
            eduda_jira Erich Duda (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: