Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 2026-03-02 - API
Affects Version/s: None
Component/s: swatch-utilization
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
AssignedTeam:
subs-swatch-lightning
Intelligence Requested:
Market:

Sprint:
Swatch Lightning Sprint 9, Swatch Lightning Sprint 10

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

The swatch-utilization service is experiencing a steady memory increase that eventually leads to OOM kills and pod restarts.

We need to investigate the root cause of the memory leak in swatch-utilization.

Note the current behaviour around the Splunk HEC event collector:

1. Splunk HEC is rejecting all log batches with

{"message":"could not find parser"}

, which indicates a source type misconfiguration on the Splunk server side. In the analyzed production logs, we found 4,155 occurrences of this error in a single pod's log file (86K lines). To be investigated by SWATCH-4562.

2. The underlying splunk-library-javalogging library (v1.11.8) has a known bug where failed batches cause a NullPointerException during error parsing (Cannot invoke "com.google.gson.JsonElement.getAsLong()" because the return value of "com.google.gson.JsonObject.get(String)" is null). This is tracked in splunk-library-javalogging#190.

3. The log volume is extremely high because the UtilizationSummaryMeasurementValidator generates WARN-level logs for every unsupported metricId, and each log line includes the full UtilizationSummary payload (~1,300 characters per line). In the analyzed log file, 29,319 out of 86,329 lines (~34%) were these unsupported metricId warnings.

4. The Splunk handler has no memory cap. As stated in the quarkus-logging-splunk documentation: "The number of events kept in memory for batching purposes is not limited." Combined with SPLUNK_HEC_RETRY_COUNT=3 (each failed batch retried 3 times over ~7 seconds), log events accumulate in memory faster than they are discarded. Already reported by https://github.com/quarkiverse/quarkus-logging-splunk/issues/255.

Acceptance Criteria

Reduce log volume: Downgrade the "unsupported metricId" and "invalid metricId" warnings from WARN to DEBUG, and stop logging the full payload (only log metricId, productId, and orgId). This alone eliminates ~83% of the log volume.
Disable retries: Set SPLUNK_HEC_RETRY_COUNT=0 so failed batches are discarded immediately instead of being retained in memory during retries.
Enable async handler with bounded queue: Set quarkus.log.handler.splunk.async.enabled=true, queue-length=512, overflow=discard. This puts a hard cap on the number of log events held in memory. When the queue is full, new events are dropped rather than accumulating indefinitely.
Reproduce the memory leak to see the before/after changes (no need a component/iqe test for this)

Assignee:: Jose Carvajal Hilario

Reporter:: Jose Carvajal Hilario

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: 2026/02/16 4:11 PM

Updated:: 2026/02/25 5:40 PM

Resolved:: 2026/02/19 11:52 AM

Details

Description

Acceptance Criteria

Attachments

Easy Agile Planning Poker

Activity

People

Dates