Uploaded image for project: 'Subscription Watch'
  1. Subscription Watch
  2. SWATCH-4570

Investigate the swatch-utilization memory leak

XMLWordPrintable

    • 3
    • False
    • Hide

      None

      Show
      None
    • False
    • subs-swatch-lightning
    • Swatch Lightning Sprint 9, Swatch Lightning Sprint 10

      The swatch-utilization service is experiencing a steady memory increase that eventually leads to OOM kills and pod restarts.

      We need to investigate the root cause of the memory leak in swatch-utilization.

      Note the current behaviour around the Splunk HEC event collector:

      1. Splunk HEC is rejecting all log batches with

      {"message":"could not find parser"}

      , which indicates a source type misconfiguration on the Splunk server side. In the analyzed production logs, we found 4,155 occurrences of this error in a single pod's log file (86K lines). To be investigated by SWATCH-4562.

      2. The underlying splunk-library-javalogging library (v1.11.8) has a known bug where failed batches cause a NullPointerException during error parsing (Cannot invoke "com.google.gson.JsonElement.getAsLong()" because the return value of "com.google.gson.JsonObject.get(String)" is null). This is tracked in splunk-library-javalogging#190.

      3. The log volume is extremely high because the UtilizationSummaryMeasurementValidator generates WARN-level logs for every unsupported metricId, and each log line includes the full UtilizationSummary payload (~1,300 characters per line). In the analyzed log file, 29,319 out of 86,329 lines (~34%) were these unsupported metricId warnings.

      4. The Splunk handler has no memory cap. As stated in the quarkus-logging-splunk documentation: "The number of events kept in memory for batching purposes is not limited." Combined with SPLUNK_HEC_RETRY_COUNT=3 (each failed batch retried 3 times over ~7 seconds), log events accumulate in memory faster than they are discarded. Already reported by https://github.com/quarkiverse/quarkus-logging-splunk/issues/255.

      Acceptance Criteria

      • Reduce log volume: Downgrade the "unsupported metricId" and "invalid metricId" warnings from WARN to DEBUG, and stop logging the full payload (only log metricId, productId, and orgId). This alone eliminates ~83% of the log volume.
      • Disable retries: Set SPLUNK_HEC_RETRY_COUNT=0 so failed batches are discarded immediately instead of being retained in memory during retries.
      • Enable async handler with bounded queue: Set quarkus.log.handler.splunk.async.enabled=true, queue-length=512, overflow=discard. This puts a hard cap on the number of log events held in memory. When the queue is full, new events are dropped rather than accumulating indefinitely.
      • Reproduce the memory leak to see the before/after changes (no need a component/iqe test for this)

              jcarvaja@redhat.com Jose Carvajal Hilario
              jcarvaja@redhat.com Jose Carvajal Hilario
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: