-
Enhancement
-
Resolution: Done
-
Major
-
0.3.4
-
None
We spent a bit more time investigating DBZ-129. We have definitely replicated our issue in DBZ-129 several times on several different versions of MySQL, and so have some of our DBAs (independently).
The observed problem is that MySQL does not allow a replica to start reading from a different master if the new master has a different channel set than the slave (see MySQL Multi-Source Replication). A concrete example would be to set up two masters: m1 and m2, and one slave: s1. Next, make m1 and m2 slaves as well (i.e. a multi-level topology), where m1 has two upstream masters (i.e. two channels), and m2 has only one master. The topology should look like so:
us1 us2 | \ | m1 m2 \ s1
Now, if you change s1's master from m1 to m2, MySQL will be unable to continue, even if the data is on m2. MySQL prints the same error that's described in DBZ-129.
The only way that we were able to make this setup work was to stop the us1 channel for both m1 and m2, and then make sure that the two had the exact same gtid_last_executed value. We synced m1 and m2 by finding the machine that was farther behind, and running: START SLAVE SQL_THREAD UNTIL SQL_AFTER_GTIDS = '<server UUID>:1149-1920' for channel 'us1';. Once both machines are in sync, we changed the s1 channel over to m2, and started both m1 and m2 again.
Obviously this dance doesn't work for all cases in a production failover scenario, so it's not viable. Once we replicated this issue just with MySQL (i.e. s1 is a MySQL slave), and confirmed it was a MySQL server-related issue, we replaced s1 with a Debezium connector. We then tested whether Debezium could do the failover. It fails in exactly the same way, and works when the SQL thread is caught up, as described above.
We then forced debezium to lie to MySQL when it connected. We stopped Debezium temporarily, and manually overwrote the Kafka connect offset commit to force Debezium to have a GTID set that included both GTIDs for us1 and us2 (even though it only cares about data from us1). We then started Debezium up again, pointing to m2 for the us1 channel this time. It appeared to work.
I'd like to add this behavior into Debezium, itself--have it auto-merge whatever GTID set is on the MySQL server that Debezium is connecting to with whatever it's got in its own offset commit. This should allow Debezium to fail over to aribtrary servers, provided that the server has the required data available (regardless of whatever other channels are in the server).
- relates to
-
DBZ-146 NPE in KafkaDatabaseHistory
- Closed