-
Bug
-
Resolution: Done
-
Major
-
2.6.2.Final
-
None
-
False
-
None
-
False
Bug report
Description
When there is an exception during the data capture in the snapshot mode (for instance, table filter expression contains a non-existent table), Debezium will fall in an infinite retry loop.
The logs contain the following:
2024-04-05 14:01:59,826 ERROR [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.relational.RelationalSnapshotChangeEventSource: Error during snapshot java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:507) at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:170) at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:94) at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:253) at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:237) at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:189) at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:140) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:566) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... 5 common frames omitted Caused by: java.sql.SQLSyntaxErrorException: Unknown column 'ship_names' in 'where clause' at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) at com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200) at io.debezium.relational.RelationalSnapshotChangeEventSource.resultSetForDataEvents(RelationalSnapshotChangeEventSource.java:643) at io.debezium.relational.RelationalSnapshotChangeEventSource.doCreateDataEventsForTable(RelationalSnapshotChangeEventSource.java:592) at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:559) ... 6 common frames omitted 2024-04-05 14:01:59,854 INFO [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.source.AbstractSnapshotChangeEventSource: Snapshot - Final stage 2024-04-05 14:01:59,854 WARN [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.source.AbstractSnapshotChangeEventSource: Snapshot was not completed successfully, it will be re-executed upon connector restart 2024-04-05 14:02:01,748 ERROR [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.ErrorHandler: Producer failure io.debezium.DebeziumException: java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:103) at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:253) at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:237) at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:189) at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:140) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122) at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191) at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:507) at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:170) at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:94) ... 9 common frames omitted Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:566) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... 5 common frames omitted Caused by: java.sql.SQLSyntaxErrorException: Unknown column 'ship_names' in 'where clause' at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120) at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122) at com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200) at io.debezium.relational.RelationalSnapshotChangeEventSource.resultSetForDataEvents(RelationalSnapshotChangeEventSource.java:643) at io.debezium.relational.RelationalSnapshotChangeEventSource.doCreateDataEventsForTable(RelationalSnapshotChangeEventSource.java:592) at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:559) ... 6 common frames omitted 2024-04-05 14:02:07,314 WARN [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.ErrorHandler: Retry 1 of 6 retries will be attempted
And the error repeats with the "Retry 1 of 6..." message, further highlighting the issue.
Investigation
Upon hooking up a debugger, seems like Debezium has a flaw in the logic handling errors during the pollRecords call.
When the io.debezium.embedded.EmbeddedEngine#pollRecords tries to poll the records and fails with an exception, it calls the io.debezium.embedded.EmbeddedEngine#handleRetries method which is supposed to somehow handle the retries
What handleRetires() does in fact is: it restarts the connector task (in our case it is MySqlConnectorTask) and handles the retries of the start operation only.
The task itself has an internal object, errorHandler which is used to track the failures in the task. In fact, its only use (as far as I can see) is in two places (both pretty useless in this scenario):
- io.debezium.connector.common.BaseSourceTask#resetErrorHandlerRetriesIfNeeded - to reset the number of retries, happens after the first successful io.debezium.connector.common.BaseSourceTask#poll run; but after the exception the task will be restarted entirely, so pretty useless in my opinion
- within the catch block in the callable returned by io.debezium.pipeline.ChangeEventSourceCoordinator#start - the error is being recorded, but never used again (there are no calls to read the recorded error)
When restarting the task (which happens at each exception caught), the errorHandler is re-initialized within the task constructor itself (so even if the error was recorded, it is never checked in the outer loop). Neither is it initialized with the previous number of retries attempted when the task is restarted.
Moreover, there are two retry loops in this logic:
- (incomplete) io.debezium.pipeline.ErrorHandler#setProducerThrowable , called from io.debezium.pipeline.ChangeEventSourceCoordinator#start - the error is recorded and the number of retries is checked, but the code is never actually retried
- io.debezium.embedded.EmbeddedEngine#handleRetries called from io.debezium.embedded.EmbeddedEngine#pollRecords - only handles the startup errors in connector task; retries are recorded in a local variable, so only good for startup retries; does not count the retries from within the task itself, but probably should
Technical details
What Debezium connector do you use and what version?
Debezium 2.6.1.Final, any connector is susceptible to the failure
- duplicates
-
DBZ-7870 Default value of error retries not interpreted correctly
- Closed