Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: under-triaging
Affects Version/s: 2.3.5.Final
Component/s: postgresql-connector
Labels:
- silent-data-loss

Blocked:
False
Blocked Reason:
None
Ready:
False

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

In order to make your issue reports as actionable as possible, please provide the following information, depending on the issue type.

Bug report

For bug reports, provide this information, please:

What Debezium connector do you use and what version?

Debezium postgres v2.3.5.Final

What is the connector configuration?

"connector.class": "io.debezium.connector.postgresql.PostgresConnector",

"snapshot.mode": "never",

"plugin.name": "pgoutput",

"producer.override.compression.type": "gzip",

"producer.override.batch.size": "65536",

"producer.override.max.request.size": "10485760",

"producer.override.acks": "all",

"max.queue.size": "81290",

"max.batch.size": "20480",

"heartbeat.interval.ms": "1000",

"heartbeat.action.query": "<>;",

"tombstones.on.delete": "false",

"decimal.handling.mode": "double",

"time.precision.mode": "connect",

"interval.handling.mode": "string",

"topic.prefix": "<some-prefix>",

What is the captured database version and mode of deployment?

postgres-13, AWS RDS

Background

In order to boost confidence in debezium connector's data capture, we use a predictive method by monitoring DB operations on a specific table and verify alignment with the destination topic.

We use a simple a heartbeat table

CREATE TABLE audit_heartbeat (

id BIGINT PRIMARY KEY, ## primary key

ts_ms BIGINT NOT NULL, ## – Timestamp in milliseconds

);

where we insert records at regular intervals (e.g., every 1s) and including it in the connector's table.include.list ensures these records are captured in Kafka. Monitoring their sequence helps detect gaps or data loss in the capture process.

What behavior do you expect?

On connector restart, Debezium should resume from the last processed LSN, ensuring no events are skipped to prevent data loss.

It's acceptable for Debezium to reprocess events from the 'last processed LSN,' which may result in duplicates, but the primary focus must be on ensuring that no events are lost.

What behavior do you see?

On connector restart, we noticed that sequence 808024 was lost after restart

{"after":

{"id":"808023","tsMs":"1728097285846"}

,"source":

{"version":"2.3.5.Final","connector":"postgresql","name":"<redacted>","tsMs":"1728097285846","snapshot":"false","db":"<redacted>","schema":"public","table":"audit_heartbeat","txId":"3712864487","lsn":"354650320202872","xmin":"0"}

,"op":"c","tsMs":"1728097286246"}

{"after":

{"id":"808025","tsMs":"1728097287653"}

,"source":

{"version":"2.3.5.Final","connector":"postgresql","name":"<redacted>","tsMs":"1728097287653","snapshot":"false","db":"<redacted>","schema":"public","table":"audit_heartbeat","txId":"3712864501","lsn":"354650320282016","xmin":"0"}

,"op":"c","tsMs":"1728097290303"}

Do you see the same behaviour using the latest released Debezium version?

(Ideally, also verify with latest Alpha/Beta/CR version)

Did not test

Do you have the connector logs, ideally from start till finish?

(You might be asked later to provide DEBUG/TRACE level log)

Offset topic entry

{"transaction_id":null,"lsn_proc":354650320202872,"messageType":"INSERT","lsn_commit":354650320199344,"lsn":354650320202872,"txId":3712864487,"ts_usec":1728097285846965}

Note that

Lsn = 354650320202872 == 1428D/765AEC78 (sequence 808023)

Connector Logs:

[2024-10-05 03:01:30,047] INFO Starting streaming (io.debezium.pipeline.ChangeEventSourceCoordinator)

[2024-10-05 03:01:30,047] INFO Retrieved latest position from stored offset 'LSN{1428D/765AEC78}' (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource)

[2024-10-05 03:01:30,048] INFO Looking for WAL restart position for last commit LSN 'LSN{1428D/765ADEB0}' and last change LSN 'LSN{1428D/765AEC78}' (io.debezium.connector.postgresql.connection.WalPositionLocator)

[2024-10-05 03:01:30,048] INFO Initializing PgOutput logical decoder publication (io.debezium.connector.postgresql.connection.PostgresReplicationConnection)

[2024-10-05 03:01:30,114] INFO Obtained valid replication slot ReplicationSlot [active=false, latestFlushedLsn=LSN\\\{1428D/765ADEB0}, catalogXmin=3712864412] (io.debezium.connector.postgresql.connection.PostgresConnection)

[2024-10-05 03:01:30,117] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection)

[2024-10-05 03:01:30,119] INFO Seeking to LSN{1428D/765ADEB0} on the replication slot with command SELECT pg_replication_slot_advance(<redacted>, '1428D/765ADEB0') (io.debezium.connector.postgresql.connection.PostgresReplicationConnection)

[2024-10-05 03:01:30,139] INFO Requested thread factory for connector PostgresConnector, id = <redacted> named = keep-alive (io.debezium.util.Threads)

[2024-10-05 03:01:30,139] INFO Creating thread debezium-postgresconnector-<redacted>-keep-alive (io.debezium.util.Threads)

[2024-10-05 03:01:30,180] INFO Searching for WAL resume position (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource)

[2024-10-05 03:01:30,182] INFO First LSN 'LSN{1428D/765AEC40}' received (io.debezium.connector.postgresql.connection.WalPositionLocator)

[2024-10-05 03:01:30,194] INFO LSN after last stored change LSN 'LSN{1428D/765AED30}' received (io.debezium.connector.postgresql.connection.WalPositionLocator)

[2024-10-05 03:01:30,194] INFO WAL resume position 'LSN{1428D/765AED30}' discovered (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource)

[2024-10-05 03:01:30,196] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection)

[2024-10-05 03:01:30,198] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection)

[2024-10-05 03:01:30,211] INFO Initializing PgOutput logical decoder publication (io.debezium.connector.postgresql.connection.PostgresReplicationConnection)

[2024-10-05 03:01:30,213] INFO Seeking to LSN{1428D/765ADEB0} on the replication slot with command SELECT pg_replication_slot_advance(<redacted>, '1428D/765ADEB0') (io.debezium.connector.postgresql.connection.PostgresReplicationConnection)

[2024-10-05 03:01:30,225] INFO Requested thread factory for connector PostgresConnector, id = <redacted> named = keep-alive (io.debezium.util.Threads)

[2024-10-05 03:01:30,225] INFO Creating thread debezium-postgresconnector-<redacted>-keep-alive (io.debezium.util.Threads)

[2024-10-05 03:01:30,225] INFO Processing messages (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource)

[2024-10-05 03:01:30,227] INFO Streaming requested from LSN LSN{1428D/765AEC78}, received LSN LSN{1428D/765AEC40} identified as already processed (io.debezium.connector.postgresql.connection.AbstractMessageDecoder)

[2024-10-05 03:01:30,247] INFO Message with LSN 'LSN{1428D/765AED30}' arrived, switching off the filtering (io.debezium.connector.postgresql.connection.WalPositionLocator)

How to reproduce the issue using our tutorial deployment?

Feature request or enhancement

For feature requests or enhancements, provide this information, please:

Which use case/requirement will be addressed by the proposed feature?

Prevent silent data loss

Implementation ideas (optional)

Introduce a connector config param to optionally disable `skipMessage` in WalPositionLocator
Update resumeFromLsn in WalPositionLocator to ensure that it always exactly resumes from lastChangeLsn even if it means reprocessing previously processed lsn to prevent data loss

Details

Description