Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 1.7.0
Affects Version/s: None
Component/s: address-controller
Labels:
None

Blocked:
False
Ready:
False
QE Test Coverage:
+
Release Note Text:
Undefined
Git Pull Request:
https://github.com/EnMasseProject/enmasse/pull/5281
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In a customer case with many thousands of addresses and many thousands of connections/links, we see the Agent CPU continually close to the CPU limit (3000). This is occasionally causing the liveness probe to fail which sometimes leads to a kubernetes restart of the container. This is causing service alerts and additional latency bringing new addresses ready.

The population of addresses is mostly stable with only occasional creates and deletes. This doesn't seem to correlate with the CPU usage.

Stats collections look suspicious. Broker and Router stats collections are driven from a JavaScript interval (10000ms - untunable). There's no serialisation that prevents the next stats invocation run starting before the last one has finished. In this customer's case as there are 3 routes and 4 brokers and the results sets will be large, it is easy to imagine that either the broker stats work will exceed > 10 seconds and the router stats work > 10 seconds.

The some of the processing of broker/outerstat results set is done in a for loop (for all address, for all connections etc). This coding pattern may contribute to blocking the event loop.

I also notice several unguarded log.debug lines with APs that are computational expensive/garbage creation even though debug is turned off.

log.debug('syncing broker %s with %j', broker.id, allocated.map(get_address));

log.debug('[%s] checking addresses, desired=%j, actual=%j => delete %j and create %j', self.id, values(self.addresses).map(address_and_type), values(actual),
stale.map(address_and_type), missing.map(address_and_type));

relates to

ENTMQMAAS-2641 [#5223] Address/connection stats collections may erroneously run concurrently leading to excessive memory use/OOMs

Closed

ENTMQMAAS-2668 [#5238] : Connections slow to appear or update in the Console.

Closed

Assignee:: Unassigned

Reporter:: Keith Wall

Tester:: David Kornel

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2020/12/10 5:22 AM

Updated:: 2021/03/17 8:36 AM

Resolved:: 2021/03/17 8:36 AM

Estimated:

Remaining:

Logged:

Not Specified

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Time Tracking