Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-685

Debezium fails to snapshot large databases

    XMLWordPrintable

Details

    • Hide

      Run debezium on a ~100G database (whitelisted tables are 100G in size - only snapshotted database size is relevant)

      Show
      Run debezium on a ~100G database (whitelisted tables are 100G in size - only snapshotted database size is relevant)

    Description

      Logs for the first few hours are all fine:

      [2018-04-17 08:47:54,330] INFO Step 8: - 4730000 of <unknown> rows scanned from table 'mydb.myothertable' after 00:32:52.806 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:47:54,376] INFO Step 8: - 4740000 of <unknown> rows scanned from table 'mydb.myothertable' after 00:32:52.852 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:47:56,948] INFO WorkerSourceTask{id=Live-Source-Debez-MySQL-ClusterA-Worker-Connector-0} Finished commitOffsets successfully in 20359 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)
      

      At one point, however, all flushes start to fail, and the scanning seems to stop during that time:

      [2018-04-17 08:48:11,676] INFO Step 8: - 70000 of <unknown> rows scanned from table 'mydb.mytable' after 00:00:12.501 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:48:11,723] INFO Step 8: - 80000 of <unknown> rows scanned from table 'mydb.mytable' after 00:00:12.454 (io.debezium.connector.mysql.SnapshotReader)
      ...
      [2018-04-17 08:48:26,958] INFO WorkerSourceTask{id=workerid-0} flushing 118670 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
      [2018-04-17 08:48:26,958] INFO WorkerSourceTask{id=workerid-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
      [2018-04-17 08:48:51,958] ERROR WorkerSourceTask{id=workerid-0} Failed to flush, timed out while waiting for producer to flush outstanding 118670 messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
      [2018-04-17 08:48:51,959] ERROR WorkerSourceTask{id=workerid-0} Failed to commit offsets (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
      

      This will happen 20 times over the next 10 minutes (this is because of the 30s flush interval). Note that the AMOUNT of flushed messages (118670 in this example) is exactly the same during all of those flush attempts.
      After that, two warnings are shown. They differ in the topic, with the "myothertable" being the table that was scanned before the current one. This somehow makes sense, since the amount of pending messages (118670) is more than we've scanned from the current table so far, so some of the pending messages are still from the previous table. Also note that the network connection during that time is completely fine - requests in both directions go through.

      [2018-04-17 08:58:11,349] WARN [Producer clientId=producer-4] Got error produce response with correlation id 2895200 on topic-partition mytable-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
      [2018-04-17 08:58:11,349] WARN [Producer clientId=producer-4] Got error produce response with correlation id 2895200 on topic-partition myothertable-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
      

      Then, polling resumes for a few seconds before a commit is attempted:

      [2018-04-17 08:58:13,047] INFO Step 8: - 90000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:13.825 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:58:13,095] INFO Step 8: - 100000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:13.873 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:58:15,086] INFO Step 8: - 110000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:15.864 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:58:15,105] INFO Step 8: committing transaction (io.debezium.connector.mysql.SnapshotReader)
      

      Then, the crash happens. My logging somehow garbles the output, so I'm not entirely sure of the correct ordering of tracebacks, but the messages involved are:

      org.apache.kafka.connect.errors.ConnectException: No operations allowed after connection closed. Error code: 0; SQLSTATE: 08003.
      Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
      Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
      
      ERROR Failed due to error: Aborting snapshot due to error when last running 'COMMIT': No operations allowed after connection closed. (io.debezium.connector.mysql.SnapshotReader)
      Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
      The last packet successfully received from the server was 11 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
      Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
      org.apache.kafka.connect.errors.ConnectException: No operations allowed after connection closed. Error code: 0; SQLSTATE: 08003.
      [2018-04-17 23:58:30,612] INFO [Producer clientId=producer-4] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer)
      [2018-04-17 23:58:30,612] ERROR WorkerSourceTask{id=workerid-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
      

      Attachments

        Activity

          People

            jpechane Jiri Pechanec
            mrtrustworthy Felix Eggert (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: