Uploaded image for project: 'JBoss A-MQ'
  1. JBoss A-MQ
  2. ENTMQ-1823

Two clustered A-MQ nodes using JDBCPersistentAdaptor and lease database locker were active the same time

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Critical
    • JBoss A-MQ 6.3
    • 6.1.1
    • broker
    • None
    • ER2
    • Hide

      Steps to reproduce:

      • lower the cleanup interval to 10 sec
      • lower the pool size to 2
      • send/receive messages from transacted and non-transacted consumers (we actually have xa and local transaction consumers)
      Show
      Steps to reproduce: lower the cleanup interval to 10 sec lower the pool size to 2 send/receive messages from transacted and non-transacted consumers (we actually have xa and local transaction consumers)

    Description

      We were able to reproduce the problem using a small pool size, we could look into the problem more deeply.

      There were two types of blocked threads:
      1) borrow connection
      2) acquire lock

      Looking at the code we were able to identify the #2 lock, which is makes the transactions wait:

      • DefaultJDBCAdapter.cleanupExclusiveLock

      The lock has 3 states:
      1) open - read lock can be acquired freely
      2) half-closed

      • outstanding read locks needs to finish first
      • write lock pending (blocking until read is done)
      • no new read lock handed out (new requests blocking)
        3) closed - until write lock(s) finish new readers blocks

      The cleanupExclusiveLock protects the internal structure of the database from changing in an unexpected way. It is used by the DB cleanup (doDeleteOldMessages) in WRITE mode. All other transactions require READ mode.

      The actual deadlock is caused by this lock, which blocks transactions to finish and others are blocked by the lack of free connection.

      For the exact explanation please check the overview picture attached.

      "Lock" represents the rw lock
      "Conn" is the connection pool.
      "Tx1-3" represent the transactions.

      There are two types of transactions:
      tx1 - "not transactional" - they first acquire the read lock and then getConnection.
      (they need to come first, in "open mode")

      tx3 - "transactional" - they try to acquire a the read lock whilst owning a connection!
      (they must come in "half-closed" mode.)

      In case of sending/receiving in transaction (TX3), the connection is acquired on BEGIN, so there is a connection already in the TransactionContext.
      (eg. "ActiveMQ Transport: tcp:///xx.xx.xx.xx:41767@14016") in jstack.xxxxx.20160531214821.AMQ.log

      In case of sending/receiving using local transaction (TX1), the connection is first acquired inside the lock.
      (eg. "ActiveMQ Transport: tcp:///xx.xx.xx.xx:27378@14016")

      The problem occurs, if TX1 type operations cannot acquire a connection, thus blocking.
      Tx2 (cleanup) is blocked, due to Tx1.
      TX3 keeps getting connections first and wait for the clean-up to finish whilst owning the connection.

      If tx3 quick enough it will make the cycle complete, then this situation will exhaust the database pool, so the server will loose the master lock.

      Regarding the fix, we don't think that getting the connection before the clean-up would help, because in our cases the process was not able to enter into the write locked area.

      You will need to move all getConnection() calls out of the read-locks or only execute the statement within the read-lock.

      Attachments

        Issue Links

          Activity

            People

              gtully@redhat.com Gary Tully
              rhn-support-qluo Joe Luo
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: