Uploaded image for project: 'AMQ Broker'
  1. AMQ Broker
  2. ENTMQBR-3587

Avoid notifications when shutting down on critical IO error

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Done
    • AMQ 7.6.0.GA
    • AMQ 7.8.0.CR1
    • broker-core
    • None
    • Hide

      Configure two brokers as master and slave that share a NFS shared store.

      • Start master
      • Start slave
      • on the machine serving the file system stop NFS service (systemctl stop nfs-server).
      • observe the master log file after about 10 minutes the following entries can be seen:
        2020-05-15 10:05:03,929 INFO  [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://rex:8161/console
        2020-05-15 10:14:02,224 WARN  [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: Input/output error
        	at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_252]
        	at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:94) [rt.jar:1.8.0_252]
        	at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1114) [rt.jar:1.8.0_252]
        	at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155) [rt.jar:1.8.0_252]
        	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:351) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:370) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.getState(FileLockNodeManager.java:316) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.access$400(FileLockNodeManager.java:41) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager$MonitorLock.run(FileLockNodeManager.java:505) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:306) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:316) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_252]
        	at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_252]
        	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_252]
        	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_252]
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_252]
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_252]
        	at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        
        2020-05-15 10:20:04,912 WARN  [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO Error while calculating disk usage: java.nio.file.FileSystemException: /mnt/shared-store/artemis/rex/master/data/paging: Input/output error
        	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) [rt.jar:1.8.0_252]
        	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) [rt.jar:1.8.0_252]
        	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) [rt.jar:1.8.0_252]
        	at sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:111) [rt.jar:1.8.0_252]
        	at sun.nio.fs.UnixFileStore.getUsableSpace(UnixFileStore.java:124) [rt.jar:1.8.0_252]
        	at org.apache.activemq.artemis.core.server.files.FileStoreMonitor.tick(FileStoreMonitor.java:104) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.files.FileStoreMonitor.run(FileStoreMonitor.java:93) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:306) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_252]
        	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_252]
        	at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
        
        2020-05-15 10:23:01,801 WARN  [org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager is expired on path 0
        2020-05-15 10:23:01,802 ERROR [org.apache.activemq.artemis.core.server] AMQ224079: The process for the virtual machine will be killed, as component org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager@15043a2f is not responsive
        2020-05-15 10:23:01,833 WARN  [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: *******************************************************************************
        Complete Thread dump
        "Thread-2 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7@1a18644)" Id=76 RUNNABLE (in native)
        	at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
        	at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
        	at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
        	at sun.nio.ch.IOUtil.write(IOUtil.java:51)
        
      • after that the process becomes unresponsive and any attempt to shut it down or kill -9 has no effect.

      broker.xml files for master and slave used in reproduction of this issue are attached.

      Show
      Configure two brokers as master and slave that share a NFS shared store. Start master Start slave on the machine serving the file system stop NFS service (systemctl stop nfs-server). observe the master log file after about 10 minutes the following entries can be seen: 2020-05-15 10:05:03,929 INFO [org.apache.activemq.artemis] AMQ241004: Artemis Console available at http://rex:8161/console 2020-05-15 10:14:02,224 WARN [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: Input/output error at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_252] at sun.nio.ch.FileDispatcherImpl.lock(FileDispatcherImpl.java:94) [rt.jar:1.8.0_252] at sun.nio.ch.FileChannelImpl.tryLock(FileChannelImpl.java:1114) [rt.jar:1.8.0_252] at java.nio.channels.FileChannel.tryLock(FileChannel.java:1155) [rt.jar:1.8.0_252] at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:351) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:370) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.getState(FileLockNodeManager.java:316) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.access$400(FileLockNodeManager.java:41) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager$MonitorLock.run(FileLockNodeManager.java:505) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:306) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$3.run(ActiveMQScheduledComponent.java:316) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [rt.jar:1.8.0_252] at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) [rt.jar:1.8.0_252] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) [rt.jar:1.8.0_252] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) [rt.jar:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_252] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] 2020-05-15 10:20:04,912 WARN [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO Error while calculating disk usage: java.nio.file.FileSystemException: /mnt/shared-store/artemis/rex/master/data/paging: Input/output error at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) [rt.jar:1.8.0_252] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) [rt.jar:1.8.0_252] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) [rt.jar:1.8.0_252] at sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:111) [rt.jar:1.8.0_252] at sun.nio.fs.UnixFileStore.getUsableSpace(UnixFileStore.java:124) [rt.jar:1.8.0_252] at org.apache.activemq.artemis.core.server.files.FileStoreMonitor.tick(FileStoreMonitor.java:104) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.files.FileStoreMonitor.run(FileStoreMonitor.java:93) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:306) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_252] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_252] at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005] 2020-05-15 10:23:01,801 WARN [org.apache.activemq.artemis.utils.critical.CriticalMeasure] Component org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager is expired on path 0 2020-05-15 10:23:01,802 ERROR [org.apache.activemq.artemis.core.server] AMQ224079: The process for the virtual machine will be killed, as component org.apache.activemq.artemis.core.persistence.impl.journal.JournalStorageManager@15043a2f is not responsive 2020-05-15 10:23:01,833 WARN [org.apache.activemq.artemis.core.server] AMQ222199: Thread dump: ******************************************************************************* Complete Thread dump "Thread-2 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7@1a18644)" Id=76 RUNNABLE (in native) at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) at sun.nio.ch.IOUtil.write(IOUtil.java:51) after that the process becomes unresponsive and any attempt to shut it down or kill -9 has no effect. broker.xml files for master and slave used in reproduction of this issue are attached.
    • Previously, when shutting itself down due to a critical IO error, the broker generated multiple notifications that triggered disk IO. These notifications could delay or event prevent shutdown of the broker. This issue is now resolved.
    • Documented as Resolved Issue
    • ?
    • Verified in a release

    Description

      A pair of master/slave brokers configured to use shared NFS store become <defunct> when the NFS shared store is removed i.e. NFS service is stoped on the machine serving the file system.

      The master broker prints a message that it is shutting down but the process never completely exits and leaves the JVM OS process hosting the broker in <defunct> state.

      2020-05-15 10:20:04,912 WARN  [org.apache.activemq.artemis.core.server] AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=IO Error while calculating disk usage: java.nio.file.FileSystemException: /mnt/shared-store/artemis/rex/master/data/paging: Input/output error
      	at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) [rt.jar:1.8.0_252]
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) [rt.jar:1.8.0_252]
      	at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) [rt.jar:1.8.0_252]
      	at sun.nio.fs.UnixFileStore.readAttributes(UnixFileStore.java:111) [rt.jar:1.8.0_252]
      	at sun.nio.fs.UnixFileStore.getUsableSpace(UnixFileStore.java:124) [rt.jar:1.8.0_252]
      	at org.apache.activemq.artemis.core.server.files.FileStoreMonitor.tick(FileStoreMonitor.java:104) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      	at org.apache.activemq.artemis.core.server.files.FileStoreMonitor.run(FileStoreMonitor.java:93) [artemis-server-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      	at org.apache.activemq.artemis.core.server.ActiveMQScheduledComponent$2.run(ActiveMQScheduledComponent.java:306) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      	at org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      	at org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [rt.jar:1.8.0_252]
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_252]
      	at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118) [artemis-commons-2.11.0.redhat-00005.jar:2.11.0.redhat-00005]
      

      Any attempts to kill -9 the JVM process have no effect.

      [tomr@rex bin]$ kill -9 4265
      [tomr@rex bin]$ ps -fu tomr
      UID          PID    PPID  C STIME TTY          TIME CMD
      tomr        1284       1  0 09:18 ?        00:00:00 /usr/lib/systemd/systemd --user
      tomr        1293    1284  0 09:18 ?        00:00:00 (sd-pam)
      tomr        2128    2124  0 09:33 ?        00:00:00 sshd: tomr@pts/0
      tomr        2130    2128  0 09:33 pts/0    00:00:00 -zsh
      tomr        2171       1  0 09:33 ?        00:00:00 /usr/bin/ssh-agent
      tomr        2183    2180  0 09:34 ?        00:00:00 sshd: tomr@pts/1
      tomr        2184    2183  0 09:34 pts/1    00:00:00 -zsh
      tomr        2234    2231  0 09:34 ?        00:00:00 sshd: tomr@pts/2
      tomr        2235    2234  0 09:34 pts/2    00:00:00 -zsh
      tomr        2284    2281  0 09:34 ?        00:00:00 sshd: tomr@pts/3
      tomr        2285    2284  0 09:34 pts/3    00:00:00 -zsh
      tomr        2334    2331  0 09:34 ?        00:00:00 sshd: tomr@pts/4
      tomr        2336    2334  0 09:34 pts/4    00:00:00 -zsh
      tomr        2383    2381  0 09:34 ?        00:00:00 sshd: tomr@pts/5
      tomr        2385    2383  0 09:34 pts/5    00:00:00 -zsh
      tomr        4265    2130  1 10:04 pts/0    00:00:12 [java] <defunct>
      tomr        4391    2184  0 10:06 pts/1    00:00:00 /bin/sh /home/tomr/Qsync/bin/threadDump.sh -p 4265 -l 200 -i 10
      tomr        7193    4391  0 10:25 pts/1    00:00:00 /usr/bin/jstack -l 4265
      tomr        7215    2235  0 10:25 pts/2    00:00:00 ps -fu tomr
      
      Slave 
      
      [tomr@jess bin]$ kill -9 2566
      [tomr@jess bin]$ ps -fu tomr
      UID          PID    PPID  C STIME TTY          TIME CMD
      tomr        1846       1  0 09:33 ?        00:00:00 /usr/lib/systemd/systemd --user
      tomr        1848    1846  0 09:33 ?        00:00:00 (sd-pam)
      tomr        1873    1839  0 09:33 ?        00:00:00 sshd: tomr@pts/0
      tomr        1874    1873  0 09:33 pts/0    00:00:00 -zsh
      tomr        1913       1  0 09:33 ?        00:00:00 /usr/bin/ssh-agent
      tomr        1926    1921  0 09:33 ?        00:00:00 sshd: tomr@pts/1
      tomr        1927    1926  0 09:33 pts/1    00:00:00 -zsh
      tomr        1979    1975  0 09:33 ?        00:00:00 sshd: tomr@pts/2
      tomr        1980    1979  0 09:33 pts/2    00:00:00 -zsh
      tomr        2035    2032  0 09:33 ?        00:00:00 sshd: tomr@pts/3
      tomr        2036    2035  0 09:33 pts/3    00:00:00 -zsh
      tomr        2087    2084  0 09:33 ?        00:00:00 sshd: tomr@pts/4
      tomr        2088    2087  0 09:33 pts/4    00:00:00 -zsh
      tomr        2141    2136  0 09:33 ?        00:00:00 sshd: tomr@pts/5
      tomr        2142    2141  0 09:33 pts/5    00:00:00 -zsh
      tomr        2566    1874  0 10:05 pts/0    00:00:10 [java] <defunct>
      tomr        2677    1927  0 10:06 pts/1    00:00:00 /bin/sh /home/tomr/Qsync/bin/threadDump.sh -p 2566 -l 200 -i 10
      tomr        5376    2677  3 10:26 pts/1    00:00:00 /usr/bin/jstack -l 2566
      tomr        5397    1980  0 10:27 pts/2    00:00:00 ps -fu tomr
      

      Broker's thread dumps for master and slave are attached below

      Attachments

        1. broker-master.xml
          13 kB
        2. broker-slave.xml
          13 kB
        3. README
          0.9 kB
        4. second-run-master.zip
          507 kB
        5. second-run-slave.zip
          518 kB

        Issue Links

          Activity

            People

              rhn-support-jbertram Justin Bertram
              rhn-support-toross Tom Ross
              Tiago Bueno Tiago Bueno
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: