Uploaded image for project: 'WildFly'
  1. WildFly
  2. WFLY-11404

Artemis throws Critical IO Error if new journal file is not created in 5 seconds

    • Icon: Enhancement Enhancement
    • Resolution: Done
    • Icon: Critical Critical
    • 17.0.0.Alpha1, 17.0.0.Final
    • 15.0.0.Beta1
    • JMS
    • None

      I can see in our CI jobs that Artemis sometimes stops because of error [1]. I looked at the code [2] where the exception is thrown and I think it could be improved a bit.

      Customer Impact: If Artemis journal is located on slower file system (like NFS) then if server is under load then it might crash. This will lead to unavailability of service. Server must be restarted to recover.

      First thing I noticed is that the 5 seconds timeout is not configurable. I agree that it should be enough in most cases but if someone would want to use NFS for Artemis journal and he doesn't care about performance, we should able him to tune this value. Additionally the timeout doesn't reflect size of journal files.

      Second thing is that when openedFiles.poll() returns null we can't be sure whether it is problem of exhausted disc or exhausted CPU. I think there should be added some kind of latch which would wait until pushOpenRunnable is executed. It will make sure that there is issue with IO operations and it was not caused by exhausted CPU.

      [1]

      09:45:07,418 WARN  [org.apache.activemq.artemis.core.server] (Thread-10 (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$4@2646099c-962838060)) AMQ222010: Critical IO Error, shutting down the server. file=NULL, message=unable to open : ActiveMQIOErrorException[errorType=IO_ERROR message=AMQ149003: File not opened]
              at org.apache.activemq.artemis.core.journal.impl.JournalFilesRepository.openFile(JournalFilesRepository.java:423) [artemis-journal-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl.moveNextFile(JournalImpl.java:2885) [artemis-journal-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl.switchFileIfNecessary(JournalImpl.java:2842) [artemis-journal-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2568) [artemis-journal-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$200(JournalImpl.java:87) [artemis-journal-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at org.apache.activemq.artemis.core.journal.impl.JournalImpl$4.run(JournalImpl.java:889) [artemis-journal-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at org.apache.activemq.artemis.utils.OrderedExecutorFactory$OrderedExecutor$ExecutorTask.run(OrderedExecutorFactory.java:101) [artemis-commons-1.5.3.002-redhat-1.jar:1.5.3.002-redhat-1]
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [rt.jar:1.8.0_111]
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [rt.jar:1.8.0_111]
              at java.lang.Thread.run(Thread.java:745) [rt.jar:1.8.0_111]
      

      [2]

      public JournalFile openFile() throws InterruptedException, ActiveMQIOErrorException {
            if (logger.isTraceEnabled()) {
               logger.trace("enqueueOpenFile with openedFiles.size=" + openedFiles.size());
            }
      
            if (openFilesExecutor == null) {
               pushOpenRunnable.run();
            } else {
               openFilesExecutor.execute(pushOpenRunnable);
            }
      
            JournalFile nextFile = openedFiles.poll(5, TimeUnit.SECONDS);
            if (nextFile == null) {
               fileFactory.onIOError(ActiveMQJournalBundle.BUNDLE.fileNotOpened(), "unable to open ", null);
               // We need to reconnect the current file with the timed buffer as we were not able to roll the file forward
               // If you don't do this you will get a NPE in TimedBuffer::checkSize where it uses the bufferobserver
               fileFactory.activateBuffer(journal.getCurrentFile().getFile());
               throw ActiveMQJournalBundle.BUNDLE.fileNotOpened();
            }
      
            if (logger.isTraceEnabled()) {
               logger.trace("Returning file " + nextFile);
            }
      
            return nextFile;
         }
      

            [WFLY-11404] Artemis throws Critical IO Error if new journal file is not created in 5 seconds

            Bulk closing issues resolved as done with Fix Version 17.0.0.Final.

            Brian Stansberry added a comment - Bulk closing issues resolved as done with Fix Version 17.0.0.Final.

            Tomas Hofman added a comment - Analysis: https://github.com/wildfly/wildfly-proposals/pull/161

            Kabir Khan added a comment -

            mnovak1@redhat.com rsvoboda@redhat.com Will this need extra testing from your side? If the answer is yes, this probably needs to be an EAP7 RFE with an analysis document (thofman I can point you to what is needed from your side if this turns out to be the case)
            CC jperkins-rhn

            Kabir Khan added a comment - mnovak1@redhat.com rsvoboda@redhat.com Will this need extra testing from your side? If the answer is yes, this probably needs to be an EAP7 RFE with an analysis document ( thofman I can point you to what is needed from your side if this turns out to be the case) CC jperkins-rhn

              thofman Tomas Hofman
              thofman Tomas Hofman
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: