-
Story
-
Resolution: Unresolved
-
Major
-
Global Hub 1.7.0
-
Product / Portfolio Work
-
False
-
-
False
-
Not Selected
-
-
-
None
User Story
As a Global Hub administrator, I want to configure explicit Kafka message retention policies so that old events are automatically cleaned up, preventing unlimited storage growth and associated memory issues.
Problem
Currently, the built-in Kafka topics are configured with only cleanup.policy: compact without explicit retention policies. This causes:
- Events to be kept indefinitely without automatic time-based or size-based deletion
- The compact policy only removes older versions of the same key, but doesn't delete old data
- All events accumulate in Kafka without bounds, contributing to memory growth in long-running clusters
Issue raised in the slack message: https://redhat-internal.slack.com/archives/C03A998ETHR/p1764637464818859
Current Configuration
Kafka has two configuration levels:
1. Broker Level (global defaults for all topics)
- log.retention.ms (Kafka default: 7 days)
- log.retention.bytes (Kafka default: unlimited)
- log.cleanup.policy (Kafka default: delete)
- Location: operator/pkg/controllers/transporter/protocol/strimzi_transporter.go:802-819
- Currently: Only replication settings configured, no retention parameters set (uses Kafka defaults)
2. Topic Level (overrides broker defaults)
- Location: operator/pkg/controllers/transporter/protocol/strimzi_transporter.go:635-637
- Currently: cleanup.policy: compact only
- No retention.ms configured
- No retention.bytes configured
- Topic config overrides broker defaults
The Issue: Even though Kafka broker has a default 7-day retention, the topic-level cleanup.policy: compact (without delete) makes the broker's retention policy ineffective. Data is kept indefinitely.
Impact
- During scale testing (<5 hours), all events remain in Kafka indefinitely
- Observed memory growth in MCGH namespace (Operator/Agent pods) is related to processing continuously growing Kafka data
- Without cleanup mechanisms, long-running production clusters will face storage exhaustion and performance degradation
- Kafka's default 7-day retention doesn't apply because topic-level cleanup.policy doesn't include delete
Proposed Solution
Add explicit retention policy at the topic level:
cleanup.policy: compact,delete retention.ms: 86400000 (24 hours, should be configurable) retention.bytes: 1073741824 (1GB per partition, optional)
Alternatively, could configure at broker level if all topics should share the same retention policy.
Benefits:
- Data automatically deleted after retention period
- Compaction still works for deduplication efficiency
- Predictable and bounded storage usage
- Prevents memory growth issues related to unlimited Kafka data accumulation
Acceptance Criteria
- Decide whether to configure retention at broker level or topic level (or both with topic overrides)
- Kafka topics configured with both compact and delete cleanup policies
- Default retention time set (recommended: 24-48 hours based on use case)
- Retention parameters (time and bytes) configurable via MulticlusterGlobalHub CR spec
- Configuration validated in scale/longevity testing showing bounded memory growth
- Documentation updated explaining:
- Broker vs topic level retention configuration
- Default retention policies
- How to customize retention settings
- Upgrade path tested for existing deployments
Additional Context
This issue is related to investigating memory growth observed during scale testing in the MCGH namespace. The current configuration with cleanup.policy: compact only means that even Kafka's default 7-day retention is not enforced, leading to indefinite data retention.