[DBZ-1766] Postgres Connector losing data on restart due to commit() being called before events produced to Kafka

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 1.0.2.Final, 1.1.0.Beta2
Affects Version/s: 0.9.5.Final, 1.0.0.Final, 1.1.0.Alpha1
Component/s: oracle-connector, postgresql-connector
Labels:
None

Steps to Reproduce:

Hide

See https://github.com/mmarvick-convoy/debezium-restart-issue

Show
See https://github.com/mmarvick-convoy/debezium-restart-issue

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

We've run into an issue where we get lost events when replicating from Postgres to Kafka while Kafka Connect restarts. This happens because a `commit()` call gets made before the events from `poll()` are actually committed to Kafka. When Kafka Connect gets terminated before some of the events actually are produced, on restart it picks up from the committed offset. This leads to data loss.

The underlying issue of these callbacks being called out of order is an open issue against Kafka Connect: https://issues.apache.org/jira/browse/KAFKA-5716

Here's a repo with steps to reproduce the issue: https://github.com/mmarvick-convoy/debezium-restart-issue

Note that in the repro steps, we use a custom build of Debezium with some additional logging during the commit and poll callbacks. You can edit the Kafka Connect dockerfile to use the latest stable version of Debezium 1.0 without the custom build, and you'll still be able to reproduce. We've been running on Debezium 0.9.5 and can still reproduce, so it's not a recent regression. Because of the underlying issue in Kafka Connect, this has probably been an issue from the beginning.

Jiri Pechanec added a comment - 2020/02/13 3:48 AM

Released

Jiri Pechanec added a comment - 2020/02/13 3:48 AM Released

Gunnar Morling added a comment - 2020/02/10 5:46 AM

I think this probably impacts SQL Server and DB2 as well, but I'd prefer one of you add the tag (those ones are out of my area of knowledge)

Thanks for pointing this out; actually it's not needed there, as for SQL Server and Db2 eventually nothing gets committed/acknowledged with the source database (unlike PG and Oracle). I'll look into removing that superfluous code for these two connectors.

Gunnar Morling added a comment - 2020/02/10 5:46 AM I think this probably impacts SQL Server and DB2 as well, but I'd prefer one of you add the tag (those ones are out of my area of knowledge) Thanks for pointing this out; actually it's not needed there, as for SQL Server and Db2 eventually nothing gets committed/acknowledged with the source database (unlike PG and Oracle). I'll look into removing that superfluous code for these two connectors.

Michael Marvick (Inactive) added a comment - 2020/02/06 11:19 PM

One more thing that we're still investigating: It looks like when Debezium starts up, it first tries to use the offsets that Kafka Connect stored, and then falls back to the LSN flushed to Postgres. We think that the Kafka Connect offsets that are stored in Kafka might also be getting committed before the records are produced to the topic. We're not 100% sure yet, but if that turns out to be true, then either (a) Kafka Connect might need to be patched or (b) Debezium might be better off just using the offset stored in the Postgres replication slot and not the one stored by Kafka Connect.

Michael Marvick (Inactive) added a comment - 2020/02/06 11:19 PM One more thing that we're still investigating: It looks like when Debezium starts up, it first tries to use the offsets that Kafka Connect stored, and then falls back to the LSN flushed to Postgres. We think that the Kafka Connect offsets that are stored in Kafka might also be getting committed before the records are produced to the topic. We're not 100% sure yet, but if that turns out to be true, then either (a) Kafka Connect might need to be patched or (b) Debezium might be better off just using the offset stored in the Postgres replication slot and not the one stored by Kafka Connect.

Michael Marvick (Inactive) added a comment - 2020/02/06 11:00 AM - edited

I think this probably impacts SQL Server and DB2 as well, but I'd prefer one of you add the tag (those ones are out of my area of knowledge)

SQL Server: https://github.com/debezium/debezium/blob/master/debezium-connector-sqlserver/src/main/java/io/debezium/connector/sqlserver/SqlServerConnectorTask.java#L149-L161

DB2: https://github.com/debezium/debezium-incubator/blob/master/debezium-connector-db2/src/main/java/io/debezium/connector/db2/Db2ConnectorTask.java#L171-L183

Michael Marvick (Inactive) added a comment - 2020/02/06 11:00 AM - edited I think this probably impacts SQL Server and DB2 as well, but I'd prefer one of you add the tag (those ones are out of my area of knowledge) SQL Server: https://github.com/debezium/debezium/blob/master/debezium-connector-sqlserver/src/main/java/io/debezium/connector/sqlserver/SqlServerConnectorTask.java#L149-L161 DB2: https://github.com/debezium/debezium-incubator/blob/master/debezium-connector-db2/src/main/java/io/debezium/connector/db2/Db2ConnectorTask.java#L171-L183

Michael Marvick (Inactive) added a comment - 2020/02/06 10:56 AM - edited

Our strategy has been to set a variable for the last processed lsn in commitRecord, but to not actually flush the lsn until commit. We made a fix against our internal fork of 0.9.5, and it passed our repro steps and is working well in prod so far. We'd be happy to put up a PR against master for Postgres with that approach (EDIT: it's here). Here's our fork's PR for reference: https://github.com/convoyinc/debezium/pull/17

In 0.9.5, the source record doesn't keep track of the last fully processed lsn but just the lsn for the record, so we have to do a little more work to keep track of the previous lsn and next lsn, and then only commit the previous when the next one is different (indicating a whole batch was processed). In the latest code, the source record includes the last fully processed lsn, so this should be easier to patch in master.

Michael Marvick (Inactive) added a comment - 2020/02/06 10:56 AM - edited Our strategy has been to set a variable for the last processed lsn in commitRecord , but to not actually flush the lsn until commit . We made a fix against our internal fork of 0.9.5, and it passed our repro steps and is working well in prod so far. We'd be happy to put up a PR against master for Postgres with that approach (EDIT: it's here ). Here's our fork's PR for reference: https://github.com/convoyinc/debezium/pull/17 In 0.9.5, the source record doesn't keep track of the last fully processed lsn but just the lsn for the record, so we have to do a little more work to keep track of the previous lsn and next lsn, and then only commit the previous when the next one is different (indicating a whole batch was processed). In the latest code, the source record includes the last fully processed lsn, so this should be easier to patch in master.

Jiri Pechanec added a comment - 2020/02/06 2:57 AM

Reading the KIP it seems that the easiest fix is to use commitRecord instead of commit . The question is performance implication of flushing for every record so it might make sense to flush only once per configurale number.

Jiri Pechanec added a comment - 2020/02/06 2:57 AM Reading the KIP it seems that the easiest fix is to use commitRecord instead of commit . The question is performance implication of flushing for every record so it might make sense to flush only once per configurale number.

Gunnar Morling added a comment - 2020/02/06 2:56 AM

This also affects the Oracle connector in all likelyhood.

Gunnar Morling added a comment - 2020/02/06 2:56 AM This also affects the Oracle connector in all likelyhood.

Assignee:: Unassigned

Reporter:: Michael Marvick (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2020/02/04 6:42 PM

Updated:: 2020/02/13 3:48 AM

Resolved:: 2020/02/12 9:45 AM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

Collapse comment: Jiri Pechanec added a comment - 2020/02/13 3:48 AM

Expand comment: Jiri Pechanec added a comment - 2020/02/13 3:48 AM

Collapse comment: Gunnar Morling added a comment - 2020/02/10 5:46 AM

Expand comment: Gunnar Morling added a comment - 2020/02/10 5:46 AM

Collapse comment: Michael Marvick (Inactive) added a comment - 2020/02/06 11:19 PM

Expand comment: Michael Marvick (Inactive) added a comment - 2020/02/06 11:19 PM

Collapse comment: Michael Marvick (Inactive) added a comment - 2020/02/06 11:00 AM, Edited by Michael Marvick - 2020/02/06 11:06 AM

Expand comment: Michael Marvick (Inactive) added a comment - 2020/02/06 11:00 AM, Edited by Michael Marvick - 2020/02/06 11:06 AM

Collapse comment: Michael Marvick (Inactive) added a comment - 2020/02/06 10:56 AM, Edited by Michael Marvick - 2020/02/06 11:20 PM

Expand comment: Michael Marvick (Inactive) added a comment - 2020/02/06 10:56 AM, Edited by Michael Marvick - 2020/02/06 11:20 PM

Collapse comment: Jiri Pechanec added a comment - 2020/02/06 2:57 AM

Expand comment: Jiri Pechanec added a comment - 2020/02/06 2:57 AM

Collapse comment: Gunnar Morling added a comment - 2020/02/06 2:56 AM

Expand comment: Gunnar Morling added a comment - 2020/02/06 2:56 AM

People

Dates