[DBZ-923] MySQL active-passive: brief data loss on failover when Debezium encounters new GTID channel

Type: Bug
Resolution: Done
Priority: Critical
Fix Version/s: 0.9.0.Beta2
Affects Version/s: 0.8.3.Final
Component/s: mysql-connector
Labels:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Lets say we have two mysql servers in standard active-passive high availability setup. If current master node fails, automation will promote passive instance to new master and it continues to serve live traffic. And debezium is connecting to master node as well.

Starting point:
Server A (current master)
uuid: abc
gtids: abc:1-100

Server B (slave)
uuid: dfg
gtid: abc:1-100 (replating from master)

Debezium is connecting to master also, so it has
gtids: abc:1-100

Now assume master node fails, failover is triggered

Server B (automation promotes it to new master)
uuid: dfg,
gtids: abc:1-100, dfg: 1-20

Server A (becomes slave, starts replication from B)
uuid: abc
gtids: abc:1-100, dfg: 1-20

Debezium after job restart:
gtids: abc:1-100, dfg:1-20,

Debezium gets connection reset error, then on job restart it successfully connects to new master (Server B), finds new gtid channel (dfg) and merges it to existing offsets and connects.

Works, BUT! There is a timing issue.

When encountering new gtid debezium starts reading it from mysql server latest gtid_executed position. So in case when mysql servers failover happens faster than debezium job failure detection and restart, the live data arriving to new master with new gtid channel (dfg in our example) is never processed in debezium. In our infra it can be several minutes of data lost as with large schemas debezium startup takes some time.

What do you think about option to specify what should debezium do when encountering new gtid - take the latest executed position and continue from there or take earlies available value on server. Default could remain "latest", but in our case "earliest" would solve our problem with lost data changes on failover. Earliest could be gtid_purged channel value or if nothing purged then from position 1.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Screen Shot 2018-09-27 at 22.48.03.png
141 kB
2018/09/27 3:48 PM

is related to

DBZ-1705 Default `gtid.new.channel.position` to earliest

Closed

Jiri Pechanec added a comment - 2018/12/19 9:10 AM

Released

Jiri Pechanec added a comment - 2018/12/19 9:10 AM Released

Eero Koplimets (Inactive) added a comment - 2018/12/10 6:36 AM

Created pull request for docs update https://github.com/debezium/debezium.github.io/pull/233

Eero Koplimets (Inactive) added a comment - 2018/12/10 6:36 AM Created pull request for docs update https://github.com/debezium/debezium.github.io/pull/233

Gunnar Morling added a comment - 2018/12/07 10:11 AM

Thanks for reporting back, pimpelsang. Btw. there's still the doc update for the new option missing. Could you file a PR for adding it to the connector docs? Thanks!

Gunnar Morling added a comment - 2018/12/07 10:11 AM Thanks for reporting back, pimpelsang . Btw. there's still the doc update for the new option missing. Could you file a PR for adding it to the connector docs? Thanks!

Eero Koplimets (Inactive) added a comment - 2018/11/27 3:37 AM

just to follow up this nightly has been running now in our env 4 days without problems.

Eero Koplimets (Inactive) added a comment - 2018/11/27 3:37 AM just to follow up this nightly has been running now in our env 4 days without problems.

Eero Koplimets (Inactive) added a comment - 2018/11/22 5:27 AM

sure, will do

Eero Koplimets (Inactive) added a comment - 2018/11/22 5:27 AM sure, will do

Gunnar Morling added a comment - 2018/11/22 4:57 AM

pimpelsang, I've added one more commit with some clean-up. Perhaps you can build the connector from source (or wait for today's nightly build) and give it another test run in your environment?

Gunnar Morling added a comment - 2018/11/22 4:57 AM pimpelsang , I've added one more commit with some clean-up. Perhaps you can build the connector from source (or wait for today's nightly build) and give it another test run in your environment?

Gunnar Morling added a comment - 2018/11/22 4:55 AM

The code change has been merged, leaving the issue open until there's a docs PR, too.

Gunnar Morling added a comment - 2018/11/22 4:55 AM The code change has been merged, leaving the issue open until there's a docs PR, too.

Gunnar Morling added a comment - 2018/11/21 5:08 PM - edited

Thanks a lot for the thorough analysis and the PR, pimpelsang! There's one potential issue I see with the proposed EARLIEST mode, and that'd be master-master set-ups. In that case the connector would already have streamed some "dfg" changes while reading from server A, so there'd be duplicated events. I reckon there's no way to avoid it, though, and after all, Debezium generally works with "at least once" semantics, so seems acceptable.

So thinking more about this, this should be fine actually. In this case the connector would have committed the offsets while reading from A, so there shouldn't be any more duplication than the always to be expected one of events emitted after the last offset commit.

Gunnar Morling added a comment - 2018/11/21 5:08 PM - edited Thanks a lot for the thorough analysis and the PR, pimpelsang ! There's one potential issue I see with the proposed EARLIEST mode, and that'd be master-master set-ups. In that case the connector would already have streamed some "dfg" changes while reading from server A, so there'd be duplicated events. I reckon there's no way to avoid it, though, and after all, Debezium generally works with "at least once" semantics, so seems acceptable. So thinking more about this, this should be fine actually. In this case the connector would have committed the offsets while reading from A, so there shouldn't be any more duplication than the always to be expected one of events emitted after the last offset commit.

Eero Koplimets (Inactive) added a comment - 2018/10/03 2:26 AM

Mostly this applies to failovers as when first time starting up and creating snapshot the current database state gets used. But when you already have offsets, then it could be that restored mysql server has new gtid channels and then data loss. There sure will be errors when that lost events part contained alters.

Eero Koplimets (Inactive) added a comment - 2018/10/03 2:26 AM Mostly this applies to failovers as when first time starting up and creating snapshot the current database state gets used. But when you already have offsets, then it could be that restored mysql server has new gtid channels and then data loss. There sure will be errors when that lost events part contained alters.

Jiri Pechanec added a comment - 2018/10/02 1:11 AM

Excellent analysis! A pull request is definitely welcome. Is it only a problem with failover? Cannot something like this happen when Debezium fails to start with unfortunate timing?

Jiri Pechanec added a comment - 2018/10/02 1:11 AM Excellent analysis! A pull request is definitely welcome. Is it only a problem with failover? Cannot something like this happen when Debezium fails to start with unfortunate timing?

Assignee:: Unassigned

Reporter:: Eero Koplimets (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2018/09/27 3:24 AM

Updated:: 2020/01/15 7:03 AM

Resolved:: 2018/12/12 10:45 AM

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

Collapse comment: Jiri Pechanec added a comment - 2018/12/19 9:10 AM

Expand comment: Jiri Pechanec added a comment - 2018/12/19 9:10 AM

Collapse comment: Eero Koplimets (Inactive) added a comment - 2018/12/10 6:36 AM

Expand comment: Eero Koplimets (Inactive) added a comment - 2018/12/10 6:36 AM

Collapse comment: Gunnar Morling added a comment - 2018/12/07 10:11 AM

Expand comment: Gunnar Morling added a comment - 2018/12/07 10:11 AM

Collapse comment: Eero Koplimets (Inactive) added a comment - 2018/11/27 3:37 AM

Expand comment: Eero Koplimets (Inactive) added a comment - 2018/11/27 3:37 AM

Collapse comment: Eero Koplimets (Inactive) added a comment - 2018/11/22 5:27 AM

Expand comment: Eero Koplimets (Inactive) added a comment - 2018/11/22 5:27 AM

Collapse comment: Gunnar Morling added a comment - 2018/11/22 4:57 AM

Expand comment: Gunnar Morling added a comment - 2018/11/22 4:57 AM

Collapse comment: Gunnar Morling added a comment - 2018/11/22 4:55 AM

Expand comment: Gunnar Morling added a comment - 2018/11/22 4:55 AM

Collapse comment: Gunnar Morling added a comment - 2018/11/21 5:08 PM, Edited by Gunnar Morling - 2018/11/22 3:20 AM

Expand comment: Gunnar Morling added a comment - 2018/11/21 5:08 PM, Edited by Gunnar Morling - 2018/11/22 3:20 AM

Collapse comment: Eero Koplimets (Inactive) added a comment - 2018/10/03 2:26 AM

Expand comment: Eero Koplimets (Inactive) added a comment - 2018/10/03 2:26 AM

Collapse comment: Jiri Pechanec added a comment - 2018/10/02 1:11 AM

Expand comment: Jiri Pechanec added a comment - 2018/10/02 1:11 AM

People

Dates