Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27030

As a Global Hub admin, I want to configure Kafka message retention policy to prevent unlimited data accumulation

XMLWordPrintable

    • None

      User Story

      As a Global Hub administrator, I want to configure explicit Kafka message retention policies so that old events are automatically cleaned up, preventing unlimited storage growth and associated memory issues.

      Problem

      Currently, the built-in Kafka topics are configured with only cleanup.policy: compact without explicit retention policies. This causes:

      • Events to be kept indefinitely without automatic time-based or size-based deletion
      • The compact policy only removes older versions of the same key, but doesn't delete old data
      • All events accumulate in Kafka without bounds, contributing to memory growth in long-running clusters

      Issue raised in the slack message: https://redhat-internal.slack.com/archives/C03A998ETHR/p1764637464818859

      Current Configuration

      Kafka has two configuration levels:

      1. Broker Level (global defaults for all topics)

      • log.retention.ms (Kafka default: 7 days)
      • log.retention.bytes (Kafka default: unlimited)
      • log.cleanup.policy (Kafka default: delete)
      • Location: operator/pkg/controllers/transporter/protocol/strimzi_transporter.go:802-819
      • Currently: Only replication settings configured, no retention parameters set (uses Kafka defaults)

      2. Topic Level (overrides broker defaults)

      • Location: operator/pkg/controllers/transporter/protocol/strimzi_transporter.go:635-637
      • Currently: cleanup.policy: compact only
      • No retention.ms configured
      • No retention.bytes configured
      • Topic config overrides broker defaults

      The Issue: Even though Kafka broker has a default 7-day retention, the topic-level cleanup.policy: compact (without delete) makes the broker's retention policy ineffective. Data is kept indefinitely.

      Impact

      • During scale testing (<5 hours), all events remain in Kafka indefinitely
      • Observed memory growth in MCGH namespace (Operator/Agent pods) is related to processing continuously growing Kafka data
      • Without cleanup mechanisms, long-running production clusters will face storage exhaustion and performance degradation
      • Kafka's default 7-day retention doesn't apply because topic-level cleanup.policy doesn't include delete

      Proposed Solution

      Add explicit retention policy at the topic level:

      cleanup.policy: compact,delete
      retention.ms: 86400000 (24 hours, should be configurable)
      retention.bytes: 1073741824 (1GB per partition, optional)
      

      Alternatively, could configure at broker level if all topics should share the same retention policy.

      Benefits:

      • Data automatically deleted after retention period
      • Compaction still works for deduplication efficiency
      • Predictable and bounded storage usage
      • Prevents memory growth issues related to unlimited Kafka data accumulation

      Acceptance Criteria

      • Decide whether to configure retention at broker level or topic level (or both with topic overrides)
      • Kafka topics configured with both compact and delete cleanup policies
      • Default retention time set (recommended: 24-48 hours based on use case)
      • Retention parameters (time and bytes) configurable via MulticlusterGlobalHub CR spec
      • Configuration validated in scale/longevity testing showing bounded memory growth
      • Documentation updated explaining:
        • Broker vs topic level retention configuration
        • Default retention policies
        • How to customize retention settings
      • Upgrade path tested for existing deployments

      Additional Context

      This issue is related to investigating memory growth observed during scale testing in the MCGH namespace. The current configuration with cleanup.policy: compact only means that even Kafka's default 7-day retention is not enforced, leading to indefinite data retention.

              rh-ee-myan Meng Yan
              rh-ee-myan Meng Yan
              Yaheng Liu Yaheng Liu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: