Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-4291

JMS client unable to reconnect after master freeze

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Not a Bug
    • Icon: Critical Critical
    • None
    • AMQ 7.7.0.CR4
    • core-jms-client
    • None
    • False
    • False
    • Undefined
    • Hide

      I'm attaching broker configuration files and my client's trace logs. Note that I'm using SIGSTOP/SIGCONT to simulate the freeze, but the end result is the same as the customer (which is just taking a heap dump on the master).

      # master (host0 61616), slave (host1 61617)
      (tcp://127.0.0.1:61616,tcp://127.0.0.2:61617)?ha=true&retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1
      
      # start master and slave brokers
      rm -rf data log && bin/artemis-service start && tail -f log/artemis.log
      
      # start the consumer application
      mvn clean compile exec:java -Pcon
      
      # pause master's JVM process
      PID=$(ps -e | grep [h]ost0 | awk '\{print $1}'); kill -SIGSTOP $PID
      
      # wait for consumer to failover to slave
      
      # release master's JVM process
      PID=$(ps -e | grep [h]ost0 | awk '\{print $1}'); kill -SIGCONT $PID
      
      # at this point we have a split-brain as there is no quorum to mitigate
      
      # stop the slave process
      
      # consumer failback to master (CheckpointA)
      
      # restart the master
      
      # consumer is able to reconnect without restart (CheckpointB)
      

      In my tests CheckpointA is failing right after logging Reconnection successful with the exception AMQ219013: Timed out waiting to receive cluster topology. Group:null" CheckpointB is also failing but there is no exception logged. Consumer logs "Reconnection successful", but no message is received. 

      Show
      I'm attaching broker configuration files and my client's trace logs. Note that I'm using SIGSTOP/SIGCONT to simulate the freeze, but the end result is the same as the customer (which is just taking a heap dump on the master). # master (host0 61616), slave (host1 61617) (tcp: //127.0.0.1:61616,tcp://127.0.0.2:61617)?ha= true &retryInterval=1000&retryIntervalMultiplier=1.0&reconnectAttempts=-1 # start master and slave brokers rm -rf data log && bin/artemis-service start && tail -f log/artemis.log # start the consumer application mvn clean compile exec:java -Pcon # pause master's JVM process PID=$(ps -e | grep [h]ost0 | awk '\{print $1}' ); kill -SIGSTOP $PID # wait for consumer to failover to slave # release master's JVM process PID=$(ps -e | grep [h]ost0 | awk '\{print $1}' ); kill -SIGCONT $PID # at this point we have a split-brain as there is no quorum to mitigate # stop the slave process # consumer failback to master (CheckpointA) # restart the master # consumer is able to reconnect without restart (CheckpointB) In my tests CheckpointA is failing right after logging Reconnection successful with the exception AMQ219013: Timed out waiting to receive cluster topology. Group:null" CheckpointB is also failing but there is no exception logged. Consumer logs "Reconnection successful", but no message is received. 

      When the master broker "freezes" (caused by a heap dump) in a single-node master-slave replication setup, clients failover to the slave broker as expected. When the master "unfreezes", we have a split-brain, this is also expected. At this point, if we stop the slave broker, JMS client try to failback to master broker after N retries. Now, they are able to connect, but soon after they all fail with the following error (all new connections get the same exception):

      javax.jms.JMSException: Failed to create session factory
      	at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createConnectionInternal(ActiveMQConnectionFactory.java:886) ~[artemis-jms-client-2.13.0.redhat-00006.jar:2.13.0.redhat-00006]
      	at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createConnection(ActiveMQConnectionFactory.java:299) ~[artemis-jms-client-2.13.0.redhat-00006.jar:2.13.0.redhat-00006]
      	at it.fvaleri.integ.ApplicationUtil.openConnection(ApplicationUtil.java:58) ~[classes/:?]
      	at it.fvaleri.integ.Application.<init>(Application.java:15) [classes/:?]
      	at it.fvaleri.integ.Application.main(Application.java:31) [classes/:?]
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_252]
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_252]
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_252]
      	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_252]
      	at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282) [exec-maven-plugin-1.6.0.jar:?]
      	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_252]
      Caused by: org.apache.activemq.artemis.api.core.ActiveMQConnectionTimedOutException: AMQ219013: Timed out waiting to receive cluster topology. Group:null
      	at org.apache.activemq.artemis.core.client.impl.ServerLocatorImpl.createSessionFactory(ServerLocatorImpl.java:712) ~[artemis-core-client-2.13.0.redhat-00006.jar:2.13.0.redhat-00006]
      	at org.apache.activemq.artemis.jms.client.ActiveMQConnectionFactory.createConnectionInternal(ActiveMQConnectionFactory.java:884) ~[artemis-jms-client-2.13.0.redhat-00006.jar:2.13.0.redhat-00006]
      	... 10 more
      

        1. broker.xml.host0
          6 kB
        2. broker.xml.host1
          6 kB
        3. client.log
          18 kB
        4. jms-client.tar.gz
          6 kB

              rhn-support-jbertram Justin Bertram
              rhn-support-fvaleri Federico Valeri
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: