-
Bug
-
Resolution: Done
-
Major
-
None
-
None
-
False
-
False
-
+
-
Undefined
-
In a customer case with many thousands of addresses and many thousands of connections/links, we see the Agent CPU continually close to the CPU limit (3000). This is occasionally causing the liveness probe to fail which sometimes leads to a kubernetes restart of the container. This is causing service alerts and additional latency bringing new addresses ready.
The population of addresses is mostly stable with only occasional creates and deletes. This doesn't seem to correlate with the CPU usage.
Stats collections look suspicious. Broker and Router stats collections are driven from a JavaScript interval (10000ms - untunable). There's no serialisation that prevents the next stats invocation run starting before the last one has finished. In this customer's case as there are 3 routes and 4 brokers and the results sets will be large, it is easy to imagine that either the broker stats work will exceed > 10 seconds and the router stats work > 10 seconds.
The some of the processing of broker/outerstat results set is done in a for loop (for all address, for all connections etc). This coding pattern may contribute to blocking the event loop.
I also notice several unguarded log.debug lines with APs that are computational expensive/garbage creation even though debug is turned off.
log.debug('syncing broker %s with %j', broker.id, allocated.map(get_address));
log.debug('[%s] checking addresses, desired=%j, actual=%j => delete %j and create %j', self.id, values(self.addresses).map(address_and_type), values(actual),
stale.map(address_and_type), missing.map(address_and_type));
- relates to
-
ENTMQMAAS-2641 [#5223] Address/connection stats collections may erroneously run concurrently leading to excessive memory use/OOMs
- Closed
-
ENTMQMAAS-2668 [#5238] : Connections slow to appear or update in the Console.
- Closed