Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-7860

Snapshot retrying logic falls into infinite retry loop

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 3.0.0.Beta1
    • 2.6.2.Final
    • embedded-engine
    • None
    • False
    • None
    • False

      Bug report

      Description

      When there is an exception during the data capture in the snapshot mode (for instance, table filter expression contains a non-existent table), Debezium will fall in an infinite retry loop.

      The logs contain the following:

      2024-04-05 14:01:59,826 ERROR [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.relational.RelationalSnapshotChangeEventSource: Error during snapshot
      java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed
              at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
              at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:507)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:170)
              at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:94)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:253)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:237)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:189)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:140)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed
              at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:566)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              ... 5 common frames omitted
      Caused by: java.sql.SQLSyntaxErrorException: Unknown column 'ship_names' in 'where clause'
              at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
              at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
              at com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.resultSetForDataEvents(RelationalSnapshotChangeEventSource.java:643)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.doCreateDataEventsForTable(RelationalSnapshotChangeEventSource.java:592)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:559)
              ... 6 common frames omitted
      2024-04-05 14:01:59,854 INFO  [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.source.AbstractSnapshotChangeEventSource: Snapshot - Final stage
      2024-04-05 14:01:59,854 WARN  [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.source.AbstractSnapshotChangeEventSource: Snapshot was not completed successfully, it will be re-executed upon connector restart
      2024-04-05 14:02:01,748 ERROR [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.ErrorHandler: Producer failure
      io.debezium.DebeziumException: java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed
              at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:103)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:253)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:237)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:189)
              at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:140)
              at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
              at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
              at java.base/java.lang.Thread.run(Thread.java:829)
      Caused by: java.util.concurrent.ExecutionException: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed
              at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
              at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:191)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:507)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:170)
              at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:94)
              ... 9 common frames omitted
      Caused by: org.apache.kafka.connect.errors.ConnectException: Snapshotting of table northwind.orders failed
              at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:566)
              at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
              ... 5 common frames omitted
      Caused by: java.sql.SQLSyntaxErrorException: Unknown column 'ship_names' in 'where clause'
              at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:120)
              at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
              at com.mysql.cj.jdbc.StatementImpl.executeQuery(StatementImpl.java:1200)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.resultSetForDataEvents(RelationalSnapshotChangeEventSource.java:643)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.doCreateDataEventsForTable(RelationalSnapshotChangeEventSource.java:592)
              at io.debezium.relational.RelationalSnapshotChangeEventSource.lambda$createDataEventsForTableCallable$7(RelationalSnapshotChangeEventSource.java:559)
              ... 6 common frames omitted
      2024-04-05 14:02:07,314 WARN  [debezium-mysqlconnector-database-change-event-source-coordinator] io.debezium.pipeline.ErrorHandler: Retry 1 of 6 retries will be attempted
       

      And the error repeats with the "Retry 1 of 6..." message, further highlighting the issue.

      Investigation

      Upon hooking up a debugger, seems like Debezium has a flaw in the logic handling errors during the pollRecords call.

       
      When the io.debezium.embedded.EmbeddedEngine#pollRecords tries to poll the records and fails with an exception, it calls the io.debezium.embedded.EmbeddedEngine#handleRetries method which is supposed to somehow handle the retries
       
      What handleRetires() does in fact is: it restarts the connector task (in our case it is MySqlConnectorTask) and handles the retries of the start operation only.
       
      The task itself has an internal object, errorHandler which is used to track the failures in the task. In fact, its only use (as far as I can see) is in two places (both pretty useless in this scenario):

      1. io.debezium.connector.common.BaseSourceTask#resetErrorHandlerRetriesIfNeeded - to reset the number of retries, happens after the first successful io.debezium.connector.common.BaseSourceTask#poll run; but after the exception the task will be restarted entirely, so pretty useless in my opinion
      2. within the catch block in the callable returned by io.debezium.pipeline.ChangeEventSourceCoordinator#start - the error is being recorded, but never used again (there are no calls to read the recorded error)

       
      When restarting the task (which happens at each exception caught), the errorHandler is re-initialized within the task constructor itself (so even if the error was recorded, it is never checked in the outer loop). Neither is it initialized with the previous number of retries attempted when the task is restarted.
       
      Moreover, there are two retry loops in this logic:

      1. (incomplete) io.debezium.pipeline.ErrorHandler#setProducerThrowable , called from io.debezium.pipeline.ChangeEventSourceCoordinator#start - the error is recorded and the number of retries is checked, but the code is never actually retried
      2. io.debezium.embedded.EmbeddedEngine#handleRetries called from io.debezium.embedded.EmbeddedEngine#pollRecords - only handles the startup errors in connector task; retries are recorded in a local variable, so only good for startup retries; does not count the retries from within the task itself, but probably should

      Technical details

      What Debezium connector do you use and what version?

      Debezium 2.6.1.Final, any connector is susceptible to the failure

        1. Screenshot 2024-07-01 at 2.17.08 PM.png
          1.44 MB
          Artem Shubovych
        2. Screenshot 2024-07-01 at 2.17.30 PM.png
          1.44 MB
          Artem Shubovych
        3. Screenshot 2024-07-01 at 2.24.18 PM.png
          1.54 MB
          Artem Shubovych

              vjuranek@redhat.com Vojtech Juranek
              artem.shubovych Artem Shubovych (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: