[ENTMQCL-2681] [jms] Wait() sometimes blocks forever when closing producers and consumers - Red Hat Issue Tracker

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 2.10.0.B1
Affects Version/s: 2.9.0.GA
Component/s: qpid-jms
Labels:
None

Blocked:
False
Ready:
False
Forum Reference:
https://issues.apache.org/jira/browse/QPIDJMS-529
Release Note Text:
Undefined
Target Release:

2.10.0.GA
Steps to Reproduce:

Hide

Hard to reproduce as the behavior of lazySet seems to be non deterministic.

Show
Hard to reproduce as the behavior of lazySet seems to be non deterministic.
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This happens during tests when the client is connected to Interconnect and produces/consumes on a queue that is on an AMQ 7 Broker through a linkRoute. Making the broker crash causes the consumer/producer to fail.

When trying to close the consumer, producer or session, QPID client eventually ends up in an infinite locked state due to a wait() without timeout. This happens from time to time and is not systematically reproducible due to the concurrent nature of the client.

Looking at the code we see that the class JmsConnection, method destroyResources() [1], there is a call to a method ProgressiveProviderFuture.sync()/BalancedProviderFuture.sync() without the timeout values. These 2 classes extend the ProviderFuture class.

ProviderFuture.onsuccess() and onfailure(), uses lazySet [2] to switch the variable state of the instance from COMPLETING to either SUCCESS or FAILURE. I suspect that because of the use of lazySet, the value of state could be actually written and readable by the sync() method from the other thread at a much later stage, thus the sync() method would not get out of its while loop with the condition isComplete()=true.

In a highly concurrent situation it could be that state=SUCCESS or FAILURE is actually written after these 2 sequences of events as follows

Execution of the synchronized section in ProviderFuture.onsuccess()/onfailure() [3]. If this happens in this order, it means that the notifyAll() would never fire, as there are no waiting threads
Execution of sync() reaches the wait()
Finally the value of state is really updated to SUCCESS or FAILURE, but it's too late as sync() is blocked by the wait() and notifyAll() has already misfired.

I believe there could be 3 possible fixes for this :

systematically put wait(SOME_FIXED_TIMEOUT) on the sync() method which is the easiest fix and probably the most robust, as the general good practice infinite waits should be avoided. This will make the sync method break out of the wait and eventually be able to read the state=SUCCESS or FAILURE
In the class JMSConnection, instead of using request.sync(), use request.sync(timeout,..). We would get the same behavior but there are more lines of code to change.
do not use lazySet to change the state but some more deterministic method. (not sure if it would definitely solve the issue though)

Note that the bug tends happen more often on a Windows machine, This is perhaps because BalancedProviderFuture has a much lower idle loop count (=100) while ProgressiveProviderFuture used on linux has up to 101000 idle loops before reaching the wait(). So in highly concurrent sitations BalancedProviderFuture will reach wait() and eventually block with a higher probablity.

[1] https://github.com/apache/qpid-jms/blob/cbcd933e056d170c85ccebce307f9b85fffa6c7e/qpid-jms-client/src/main/java/org/apache/qpid/jms/JmsConnection.java#L748

[2] https://github.com/apache/qpid-jms/blob/cbcd933e056d170c85ccebce307f9b85fffa6c7e/qpid-jms-client/src/main/java/org/apache/qpid/jms/provider/ProviderFuture.java#L80

[3] https://github.com/apache/qpid-jms/blob/cbcd933e056d170c85ccebce307f9b85fffa6c7e/qpid-jms-client/src/main/java/org/apache/qpid/jms/provider/ProviderFuture.java#L82

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

console-output.log
10.12 MB
2021/03/25 9:48 AM
console-output-run2.zip
3.84 MB
2021/03/26 6:36 AM
console-run3-with additionallogs.zip
3.88 MB
2021/03/29 6:18 AM
console-run3-with additional-logs.zip
3.88 MB
2021/03/29 6:18 AM
trace-level-log.log
35.31 MB
2021/03/25 9:50 AM

is related to

ENTMQIC-2750 Improper handling of link refusal when destination peer for link-routing goes away

Assignee:: Robbie Gemmell

Reporter:: Alain Pham (Inactive)

Tester:: Jiri Daněk

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2021/03/24 3:03 PM

Updated:: 2022/09/09 7:14 AM

Resolved:: 2021/08/24 10:22 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide