Uploaded image for project: 'JBoss A-MQ'
  1. JBoss A-MQ
  2. ENTMQ-1168

Broker cannot be restarted in shared fs master slave after network failure

XMLWordPrintable

    • Hide

      It is reproducible even when only one instance is deployed:

      1) start A-MQ with persistent storage on shared NFSv4
      2) block connection between A-MQ and NFS server (e.g. using iptables – see below)
      3) wait until A-MQ reports failure in the broker shutdown procedure in the log
      4) reestablish connection between A-MQ and NFS server
      5) manually restart broker bundle
      6) broker will not start since it will claims that database is locked

      Iptables command executed on machine where A-MQ is running:

      sudo iptables -A INPUT -s $NFS_SERVER_IP -j DROP
      sudo iptables -A OUTPUT -d $NFS_SERVER_IP -j DROP
      
      Show
      It is reproducible even when only one instance is deployed: 1) start A-MQ with persistent storage on shared NFSv4 2) block connection between A-MQ and NFS server (e.g. using iptables – see below) 3) wait until A-MQ reports failure in the broker shutdown procedure in the log 4) reestablish connection between A-MQ and NFS server 5) manually restart broker bundle 6) broker will not start since it will claims that database is locked Iptables command executed on machine where A-MQ is running: sudo iptables -A INPUT -s $NFS_SERVER_IP -j DROP sudo iptables -A OUTPUT -d $NFS_SERVER_IP -j DROP

      I have two instances (A and B) in shared filesystem master slave configuration are deployed (A is master B is slave). When I simulate network failure between master and NFS server then B becomes master and A starts its shutdown procedure. A's shutdowns procedure throws exceptions related to I/O error (see attached log file) since kahaDB folder on shared NFS is unreachable and A does shut down.

      But when I stop broker B (which is currently master), reestablish connection between A and NFS server and manually restart broker on A. Then it claims that DB is locked and broker on A will never start (which is bad especially in case that broker is restarted automatically by setting restartAllowed="true" in activemq.xml). Only solution to successfully start broker on A is to stop whole Fuse and start it again.

      When debug logging is enabled on SharedFileLocker class it claims that:

      08:42:19,559 | DEBUG | AMQ-2-thread-1   | SharedFileLocker                 | 140 - org.apache.activemq.activemq-osgi - 5.11.0.redhat-621032 | Database /mnt/nfs/fuse-shared/standaloneFaframTest/lock is locked... waiting 10 seconds for the database to be unlocked. Reason: java.io.IOException: File '/mnt/nfs/fuse-shared/standaloneFaframTest/lock' could not be locked as lock is already held for this jvm.
      

      So it look like that A still holds the lock but it is not possible since in between B instance was a master.

      Persistent storage configuration:

              <persistenceAdapter>
                      <kahaDB directory="/mnt/nfs/fuse-shared/standaloneFaframTest/" lockKeepAlivePeriod="2000">
                              <locker>
                                      <shared-file-locker lockAcquireSleepInterval="10000" />
                              </locker>
                      </kahaDB>
              </persistenceAdapter>
      

              gtully@redhat.com Gary Tully
              knetl.j@gmail.com Jakub Knetl (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: