-
Enhancement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
-
False
Feature request or enhancement
Introduce a new configuration option to the PostgreSQL connector to allow users to treat the Replication Slot as the durable source of truth.
Which use case/requirement will be addressed by the proposed feature?
When the Debezium PostgreSQL connector starts, it compares the LSN recorded in its offset store with the LSN reported by the replication slot. Currently, if the stored offset is behind the slot (offset_lsn < slot_lsn), Debezium enforces a "Fail Fast" policy, crashing with the error: "Saved offset is before replication slot's confirmed lsn".
While this protects against data loss from re-created slots, it also forces operators to perform a full re-snapshot of the database to recover from benign scenarios where the mismatch is due to a deliberate intervention.
- Treating the Slot as the Durable Source of Truth Similar to how Kafka's auto.offset.reset configuration allows consumers to opt-in to trusting the broker's position when their local state is invalid, Debezium should allow users to opt-in to trusting the PostgreSQL Replication Slot's position. If the connector's offset store is stale, the connector could then "jump ahead" to the Slot's position rather than failing.
Key Use Cases:
- Respecting Manual Intervention: If an operator manually advances the slot (via pg_replication_slot_advance) to skip corrupted WAL, it should be possible to configure Debezium to respect this change instead of refusing to start. At Zalando, we make use of the ephemeral MemoryOffsetBackingStore store to allow us to do just this.
- Recovering from Unmonitored WAL Advancement: In idle scenarios, DBZ-9641 allows users to opt-in to allowing non-monitored events to advance the replication slot beyond the connector's offset. Users who make use of durable OffsetBackingStores that want to use this feature will need a way to accept this new slot position without a hard failure.
Implementation ideas (optional)
Deprecate the boolean slot.seek.to.known.offset.on.start and introduce the enum offset.mismatch.strategy. This defines behavior during startStreaming when offset_lsn and slot_lsn differ.
Proposed Enum Values:
- FAIL (Default)
- Behavior: Throw an exception if LSNs do not match.
- Rationale: Preserves current safety. Forces human intervention to investigate potential data loss (Parallels Kafka auto.offset.reset = none).
- TRUST_OFFSET
- Behavior: If offset_lsn > slot_lsn, advance the Slot to the offset LSN. If offset_lsn < slot_lsn, fail.
- Rationale: Replaces slot.seek.to.known.offset.on.start = true. Prioritizes "At-Least-Once" delivery to prevent re-streaming duplicates after a crash where the offset was committed but the slot flush failed.
- TRUST_SLOT
- Behavior: If offset_lsn < slot_lsn, advance the Connector Offset ("jump ahead") to the slot's LSN. If offset_lsn > slot_lsn, fail.
- Rationale: Solves the "Hard Reset" problem. Allows recovery from WAL advancement by treating the database slot as the source of truth.
- TRUST_GREATER_LSN
- Behavior: Automatically synchronize to max(offset_lsn, slot_lsn).
- Rationale: A "self-healing" mode for advanced users who want to recover from both crash-before-flush AND deliberate wal-advance scenarios automatically.
Code Location: Minimal changess to logic in PostgresReplicationConnection#startStreaming, that determins the starting LSN before the createReplicationStream call.