Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-4412

master/slave brokers both become unresponsive with nfs-side HA failover

    XMLWordPrintable

Details

    Description

      When performing a HA failover from one NFS server to another (should be transparent to the application), both AMQ servers become unresponsive. The processes are still active, but consumers are not able to connect.

      This is an intermittent problem, does not occur on every failover event.

      Logs
      =========
      
      The logs show that both brokers receive an error at 16:04:24:
      
      Master
      -----------
      //NOTE: logs only go back to 16:00
      $ egrep -r "Shutting|Starting|ERROR|WARN" amq01-logs | egrep -v "AMQ222061|AMQ224016|AMQ222107|AMQ212037" 
      
      amq01-logs/artemis.log.2:2020-12-21 16:04:25,224 WARN [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO Error while calculating disk usage: java.nio.file.FileSystemException: /amqdata/amq7_broker_data/paging: Input/output error
      
      
      Slave
      -----------
      $ egrep -r "Shutting|Starting|ERROR|WARN" amq02-logs | egrep -v "AMQ222061|AMQ224016|AMQ222107|AMQ212037"
      
      amq02-logs/artemis.log.5:2020-12-21 16:04:24,092 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: Input/output error
      
      
      Thread Dumps
      ===============
      
      The thread dumps, captured about 15 mins later, indicate the following:
      
      Master
      ----------
      
      "Thread-5 (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@46074492)"
      #34 prio=5 os_prio=0 cpu=265.58ms elapsed=774.83s tid=0x00007f476d086800 nid=0x17cb waiting on condition [0x00007f470fbf9000]
         java.lang.Thread.State: TIMED_WAITING (parking)
      at jdk.internal.misc.Unsafe.park(java.base@11.0.7/Native Method)
      - parking to wait for  <0x000000009dece8c8> (a java.util.concurrent.CountDownLatch$Sync)
      at java.util.concurrent.locks.LockSupport.parkNanos(java.base@11.0.7/LockSupport.java:234)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(java.base@11.0.7/AbstractQueuedSynchronizer.java:1079)
      at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(java.base@11.0.7/AbstractQueuedSynchronizer.java:1369)
      at java.util.concurrent.CountDownLatch.await(java.base@11.0.7/CountDownLatch.java:278)
      at org.apache.activemq.artemis.core.journal.impl.SimpleWaitIOCallback.waitCompletion(SimpleWaitIOCallback.java:61)
      at org.apache.activemq.artemis.core.journal.impl.JournalBase.appendCommitRecord(JournalBase.java:63)
      at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendCommitRecord(JournalImpl.java:91)
      at org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.commitBindings(AbstractJournalStorageManager.java:658)
      ...
      "Thread-10" #106 prio=5 os_prio=0 cpu=355.87ms elapsed=547.38s tid=0x00007f4734002000 nid=0x1bc0 waiting for monitor entry [0x00007f47033e3000]
         java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.stop(PostOfficeImpl.java:198)
      - waiting to lock <0x000000008069d340> (a org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl)
      at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.stopComponent(ActiveMQServerImpl.java:1356)
      at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.stop(ActiveMQServerImpl.java:1170)
      at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl.stop(ActiveMQServerImpl.java:1051)
      at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5.run(ActiveMQServerImpl.java:857)
         Locked ownable synchronizers:
      - None
      
      Slave
      ----------
      
      "AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=ee4ebeb8-2391-11eb-9967-0021f64befb7"
      #15 prio=5 os_prio=0 cpu=317.50ms elapsed=685.73s tid=0x00007f39c8f48000 nid=0x4deb waiting on condition [0x00007f399cae5000]
         java.lang.Thread.State: TIMED_WAITING (sleeping)
              at java.lang.Thread.sleep(java.base@11.0.6/Native Method)
              at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:183)
              at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
              at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3730)
      
         Locked ownable synchronizers:
              - None
      

      Attachments

        Issue Links

          Activity

            People

              fnigro Francesco Nigro
              rhn-support-shiggs Stephen Higgs
              Tiago Bueno Tiago Bueno
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: