Uploaded image for project: 'AMQ Clients'
  1. AMQ Clients
  2. ENTMQCL-2681

[jms] Wait() sometimes blocks forever when closing producers and consumers

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 2.10.0.B1
    • 2.9.0.GA
    • qpid-jms
    • None
    • False
    • False
    • Undefined
    • Hide

      Hard to reproduce as the behavior of lazySet seems to be non deterministic.

      Show
      Hard to reproduce as the behavior of lazySet seems to be non deterministic.

      This happens during tests when the client is connected to Interconnect and produces/consumes on a queue that is on an AMQ 7 Broker through a linkRoute. Making the broker crash causes the consumer/producer to fail.

      When trying to close the consumer, producer or session, QPID client eventually ends up in an infinite locked state due to a wait() without timeout. This happens from time to time and is not systematically reproducible due to the concurrent nature of the client.

      Looking at the code we see that the class JmsConnection, method destroyResources() [1], there is a call to a method ProgressiveProviderFuture.sync()/BalancedProviderFuture.sync() without the timeout values. These 2 classes extend the ProviderFuture class.

      ProviderFuture.onsuccess() and onfailure(), uses lazySet [2] to switch the variable state of the instance from COMPLETING to either SUCCESS or FAILURE. I suspect that because of the use of lazySet, the value of state could be actually written and readable by the sync() method from the other thread at a  much later stage, thus the sync() method would not get out of its while loop with the condition isComplete()=true.

      In a highly concurrent situation it could be that state=SUCCESS or FAILURE is actually written after these 2 sequences of events as follows

      • Execution of the synchronized section in ProviderFuture.onsuccess()/onfailure() [3]. If this happens in this order, it means that the notifyAll() would never fire, as there are no waiting threads
      • Execution of sync() reaches the wait()
      • Finally the value of state is really updated to SUCCESS or FAILURE, but it's too late as sync() is blocked by the wait() and notifyAll() has already misfired.

       

       

      I believe there could be 3 possible fixes for this : 

      • systematically put wait(SOME_FIXED_TIMEOUT) on the sync() method which is the easiest fix and probably the most robust, as the general good practice infinite waits should be avoided. This will make the sync method break out of the wait and eventually be able to read the state=SUCCESS or FAILURE
      • In the class JMSConnection, instead of using request.sync(), use request.sync(timeout,..). We would get the same behavior but there are more lines of code to change.
      • do not use lazySet to change the state but some more deterministic method. (not sure if it would definitely solve the issue though)

      Note that the bug tends happen more often on a Windows machine, This is perhaps because BalancedProviderFuture has a much lower idle loop count (=100) while ProgressiveProviderFuture used on linux has up to 101000 idle loops before reaching the wait(). So in highly concurrent sitations BalancedProviderFuture will reach wait() and eventually block with a higher probablity.

       

      [1] https://github.com/apache/qpid-jms/blob/cbcd933e056d170c85ccebce307f9b85fffa6c7e/qpid-jms-client/src/main/java/org/apache/qpid/jms/JmsConnection.java#L748 

      [2] https://github.com/apache/qpid-jms/blob/cbcd933e056d170c85ccebce307f9b85fffa6c7e/qpid-jms-client/src/main/java/org/apache/qpid/jms/provider/ProviderFuture.java#L80

       [3] https://github.com/apache/qpid-jms/blob/cbcd933e056d170c85ccebce307f9b85fffa6c7e/qpid-jms-client/src/main/java/org/apache/qpid/jms/provider/ProviderFuture.java#L82 

        1. console-output.log
          10.12 MB
          Alain Pham
        2. trace-level-log.log
          35.31 MB
          Alain Pham

              gemmellr Robbie Gemmell
              alainpham Alain Pham (Inactive)
              Jiri Daněk Jiri Daněk
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: