-
Bug
-
Resolution: Done
-
Critical
-
0.8.3.Final
-
None
Lets say we have two mysql servers in standard active-passive high availability setup. If current master node fails, automation will promote passive instance to new master and it continues to serve live traffic. And debezium is connecting to master node as well.
Starting point:
Server A (current master)
uuid: abc
gtids: abc:1-100
Server B (slave)
uuid: dfg
gtid: abc:1-100 (replating from master)
Debezium is connecting to master also, so it has
gtids: abc:1-100
Now assume master node fails, failover is triggered
Server B (automation promotes it to new master)
uuid: dfg,
gtids: abc:1-100, dfg: 1-20
Server A (becomes slave, starts replication from B)
uuid: abc
gtids: abc:1-100, dfg: 1-20
Debezium after job restart:
gtids: abc:1-100, dfg:1-20,
Debezium gets connection reset error, then on job restart it successfully connects to new master (Server B), finds new gtid channel (dfg) and merges it to existing offsets and connects.
Works, BUT! There is a timing issue.
When encountering new gtid debezium starts reading it from mysql server latest gtid_executed position. So in case when mysql servers failover happens faster than debezium job failure detection and restart, the live data arriving to new master with new gtid channel (dfg in our example) is never processed in debezium. In our infra it can be several minutes of data lost as with large schemas debezium startup takes some time.
What do you think about option to specify what should debezium do when encountering new gtid - take the latest executed position and continue from there or take earlies available value on server. Default could remain "latest", but in our case "earliest" would solve our problem with lost data changes on failover. Earliest could be gtid_purged channel value or if nothing purged then from position 1.
- is related to
-
DBZ-1705 Default `gtid.new.channel.position` to earliest
- Closed