-
Bug
-
Resolution: Done
-
Critical
-
None
-
None
-
3
-
False
-
-
False
-
subs-swatch-lightning
-
-
-
Swatch Lightning Sprint 9, Swatch Lightning Sprint 10
The swatch-utilization service is experiencing a steady memory increase that eventually leads to OOM kills and pod restarts.
We need to investigate the root cause of the memory leak in swatch-utilization.
Note the current behaviour around the Splunk HEC event collector:
1. Splunk HEC is rejecting all log batches with
{"message":"could not find parser"}
, which indicates a source type misconfiguration on the Splunk server side. In the analyzed production logs, we found 4,155 occurrences of this error in a single pod's log file (86K lines). To be investigated by SWATCH-4562.
2. The underlying splunk-library-javalogging library (v1.11.8) has a known bug where failed batches cause a NullPointerException during error parsing (Cannot invoke "com.google.gson.JsonElement.getAsLong()" because the return value of "com.google.gson.JsonObject.get(String)" is null). This is tracked in splunk-library-javalogging#190.
3. The log volume is extremely high because the UtilizationSummaryMeasurementValidator generates WARN-level logs for every unsupported metricId, and each log line includes the full UtilizationSummary payload (~1,300 characters per line). In the analyzed log file, 29,319 out of 86,329 lines (~34%) were these unsupported metricId warnings.
4. The Splunk handler has no memory cap. As stated in the quarkus-logging-splunk documentation: "The number of events kept in memory for batching purposes is not limited." Combined with SPLUNK_HEC_RETRY_COUNT=3 (each failed batch retried 3 times over ~7 seconds), log events accumulate in memory faster than they are discarded. Already reported by https://github.com/quarkiverse/quarkus-logging-splunk/issues/255.
Acceptance Criteria
- Reduce log volume: Downgrade the "unsupported metricId" and "invalid metricId" warnings from WARN to DEBUG, and stop logging the full payload (only log metricId, productId, and orgId). This alone eliminates ~83% of the log volume.
- Disable retries: Set SPLUNK_HEC_RETRY_COUNT=0 so failed batches are discarded immediately instead of being retained in memory during retries.
- Enable async handler with bounded queue: Set quarkus.log.handler.splunk.async.enabled=true, queue-length=512, overflow=discard. This puts a hard cap on the number of log events held in memory. When the queue is full, new events are dropped rather than accumulating indefinitely.
- Reproduce the memory leak to see the before/after changes (no need a component/iqe test for this)