Uploaded image for project: 'JBoss Enterprise Application Platform'
  1. JBoss Enterprise Application Platform
  2. JBEAP-7968

(7.1.0) The backup server is not responding promptly introducing latency beyond the limit.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Blocker
    • 7.1.0.ER3
    • 7.1.0.DR9, 7.1.0.DR11, 7.1.0.DR12, 7.1.0.DR14, 7.1.0.DR15, 7.1.0.DR16, 7.1.0.DR18, 7.1.0.DR19, 7.1.0.ER1
    • ActiveMQ
    • Blocks Testing
    • Hide
      git clone git://git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git
      cd eap-tests-hornetq/scripts/
      git checkout 50db1a0dcc9eb6c6876c0254a4ce45d569a78ff3
      groovy -DEAP_VERSION=7.1.0.DR19 PrepareServers7.groovy
      export WORKSPACE=$PWD
      export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap
      export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap
      export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap
      export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap
      
      cd ../jboss-hornetq-testsuite/
      
      mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testFailbackClientAckTopic -DfailIfNoTests=false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1.0.DR19 | tee log
      
      Show
      git clone git: //git.app.eng.bos.redhat.com/jbossqe/eap-tests-hornetq.git cd eap-tests-hornetq/scripts/ git checkout 50db1a0dcc9eb6c6876c0254a4ce45d569a78ff3 groovy -DEAP_VERSION=7.1.0.DR19 PrepareServers7.groovy export WORKSPACE=$PWD export JBOSS_HOME_1=$WORKSPACE/server1/jboss-eap export JBOSS_HOME_2=$WORKSPACE/server2/jboss-eap export JBOSS_HOME_3=$WORKSPACE/server3/jboss-eap export JBOSS_HOME_4=$WORKSPACE/server4/jboss-eap cd ../jboss-hornetq-testsuite/ mvn clean test -Dtest=ReplicatedDedicatedFailoverTestCase#testFailbackClientAckTopic -DfailIfNoTests= false -Deap=7x -Deap7.org.jboss.qa.hornetq.apps.clients.version=7.1.0.DR19 | tee log
    • AMQ Sprint 3

    Description

      In replicated HA scenarios I can see the replication is broken because of [1].

      This issue was already discussed in JBEAP-4742, see comments. As a solution the timeout was made configurable. You can configure it using call-timeout in cluster-connection.

      I have seen this issue in our CI but I have suspected it is an environment issue caused by slow NFS. However I dug into this a bit more. Here are my findings.

      It seems that something hangs the synchronization process because increasing of call-timeout doesn't help.

      I have tracked sending and receiving of synchronization packets in trace logs. There is 60s window in which no packet is handled or sent. Hanging packets are received after the [1] is printed to log and replication is canceled.

      When I set call-timeout to 2 minutes, replication fails because of connection timeout error.

      I can easily reproduce the issue in our CI, but I can't reproduce it locally on my laptop. Maybe there is some race condition which reveals only in slower environment.

      I can see the same issue with 7.0.x.

      Tip for debug: On both servers there is one thread which takes care about sending/handling replication packets. You can track these threads in trace logs, see attachment.

      [1]

      10:43:58,180 WARN  [org.apache.activemq.artemis.core.server] (Thread-131) AMQ222207: The backup server is not responding promptly introducing latency beyond the limit. Replication server being disconnected now.
      

      Customer impact: Replication between Live and Backup may fail and the process is not restored automatically. This can happen during initial synchronization between live->backup when backup is started for the first time or after failback. This can be hit when executing proof of concept by user/customer. Admin has to identify such situation and restart server which acts as Backup. Backup will not activate if Live server crashes which will lead to unavailability of service.

      Attachments

        1. node-2-thread-dump-11-06-11.txt
          166 kB
          Erich Duda
        2. packet-analysis.png
          11.15 MB
          Erich Duda

        Issue Links

          Activity

            People

              csuconic@redhat.com Clebert Suconic
              eduda_jira Erich Duda (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: