Uploaded image for project: 'Red Hat Advanced Cluster Management'
  1. Red Hat Advanced Cluster Management
  2. ACM-27032

As a Global Hub admin, I want to configure Kafka message retention policy to prevent unlimited data accumulation

XMLWordPrintable

    • Product / Portfolio Work
    • False
    • Hide

      None

      Show
      None
    • False
    • Not Selected
    • None

      User Story

      As a Global Hub administrator, I want to configure explicit Kafka message retention policies so that old events are automatically cleaned up, preventing unlimited storage growth and associated memory issues.

      Problem

      Currently, the built-in Kafka topics are configured with only cleanup.policy: compact without explicit retention policies. This causes:
      - Events to be kept indefinitely without automatic time-based or size-based deletion
      - The compact policy only removes older versions of the same key, but does not delete old data
      - All events accumulate in Kafka without bounds, contributing to memory growth in long-running clusters

      Current Configuration

      Kafka has two configuration levels:

      1. Broker Level (global defaults for all topics):
        • log.retention.ms (Kafka default: 7 days)
        • log.retention.bytes (Kafka default: unlimited)
        • log.cleanup.policy (Kafka default: delete)
        • Location: operator/pkg/controllers/transporter/protocol/strimzi_transporter.go:802-819
        • Currently: Only replication settings configured, no retention parameters set (uses Kafka defaults)
      2. Topic Level (overrides broker defaults):
        • Location: operator/pkg/controllers/transporter/protocol/strimzi_transporter.go:635-637
        • Currently: cleanup.policy: compact only
        • No retention.ms configured
        • No retention.bytes configured
        • Topic config overrides broker defaults

      The Issue: Even though Kafka broker has a default 7-day retention, the topic-level cleanup.policy: compact (without delete) makes the broker retention policy ineffective. Data is kept indefinitely.

      Impact

      • During scale testing ( hours), all events remain in Kafka indefinitely
      • Observed memory growth in MCGH namespace (Operator/Agent pods) is related to processing continuously growing Kafka data
      • Without cleanup mechanisms, long-running production clusters will face storage exhaustion and performance degradation
      • Kafka default 7-day retention does not apply because topic-level cleanup.policy does not include delete

      Proposed Solution

      Add explicit retention policy at the topic level:
      - cleanup.policy: compact,delete
      - retention.ms: 86400000 (24 hours, should be configurable)
      - retention.bytes: 1073741824 (1GB per partition, optional)

      Alternatively, could configure at broker level if all topics should share the same retention policy.

      Benefits:
      - Data automatically deleted after retention period
      - Compaction still works for deduplication efficiency
      - Predictable and bounded storage usage
      - Prevents memory growth issues related to unlimited Kafka data accumulation

      Acceptance Criteria

      • Decide whether to configure retention at broker level or topic level (or both with topic overrides)
      • Kafka topics configured with both compact and delete cleanup policies
      • Default retention time set (recommended: 24-48 hours based on use case)
      • Retention parameters (time and bytes) configurable via MulticlusterGlobalHub CR spec
      • Configuration validated in scale/longevity testing showing bounded memory growth
      • Documentation updated explaining broker vs topic level retention configuration, default retention policies, and how to customize retention settings
      • Upgrade path tested for existing deployments

      Additional Context

      This issue is related to investigating memory growth observed during scale testing in the MCGH namespace. The current configuration with cleanup.policy: compact only means that even Kafka default 7-day retention is not enforced, leading to indefinite data retention.

      Generated with Claude Code https://claude.com/claude-code

              clyang82 Chunlin Yang
              rh-ee-myan Meng Yan
              Yaheng Liu Yaheng Liu
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: