• Hide

      Run debezium on a ~100G database (whitelisted tables are 100G in size - only snapshotted database size is relevant)

      Show
      Run debezium on a ~100G database (whitelisted tables are 100G in size - only snapshotted database size is relevant)

      Logs for the first few hours are all fine:

      [2018-04-17 08:47:54,330] INFO Step 8: - 4730000 of <unknown> rows scanned from table 'mydb.myothertable' after 00:32:52.806 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:47:54,376] INFO Step 8: - 4740000 of <unknown> rows scanned from table 'mydb.myothertable' after 00:32:52.852 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:47:56,948] INFO WorkerSourceTask{id=Live-Source-Debez-MySQL-ClusterA-Worker-Connector-0} Finished commitOffsets successfully in 20359 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)
      

      At one point, however, all flushes start to fail, and the scanning seems to stop during that time:

      [2018-04-17 08:48:11,676] INFO Step 8: - 70000 of <unknown> rows scanned from table 'mydb.mytable' after 00:00:12.501 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:48:11,723] INFO Step 8: - 80000 of <unknown> rows scanned from table 'mydb.mytable' after 00:00:12.454 (io.debezium.connector.mysql.SnapshotReader)
      ...
      [2018-04-17 08:48:26,958] INFO WorkerSourceTask{id=workerid-0} flushing 118670 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
      [2018-04-17 08:48:26,958] INFO WorkerSourceTask{id=workerid-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
      [2018-04-17 08:48:51,958] ERROR WorkerSourceTask{id=workerid-0} Failed to flush, timed out while waiting for producer to flush outstanding 118670 messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
      [2018-04-17 08:48:51,959] ERROR WorkerSourceTask{id=workerid-0} Failed to commit offsets (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)
      

      This will happen 20 times over the next 10 minutes (this is because of the 30s flush interval). Note that the AMOUNT of flushed messages (118670 in this example) is exactly the same during all of those flush attempts.
      After that, two warnings are shown. They differ in the topic, with the "myothertable" being the table that was scanned before the current one. This somehow makes sense, since the amount of pending messages (118670) is more than we've scanned from the current table so far, so some of the pending messages are still from the previous table. Also note that the network connection during that time is completely fine - requests in both directions go through.

      [2018-04-17 08:58:11,349] WARN [Producer clientId=producer-4] Got error produce response with correlation id 2895200 on topic-partition mytable-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
      [2018-04-17 08:58:11,349] WARN [Producer clientId=producer-4] Got error produce response with correlation id 2895200 on topic-partition myothertable-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
      

      Then, polling resumes for a few seconds before a commit is attempted:

      [2018-04-17 08:58:13,047] INFO Step 8: - 90000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:13.825 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:58:13,095] INFO Step 8: - 100000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:13.873 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:58:15,086] INFO Step 8: - 110000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:15.864 (io.debezium.connector.mysql.SnapshotReader)
      [2018-04-17 08:58:15,105] INFO Step 8: committing transaction (io.debezium.connector.mysql.SnapshotReader)
      

      Then, the crash happens. My logging somehow garbles the output, so I'm not entirely sure of the correct ordering of tracebacks, but the messages involved are:

      org.apache.kafka.connect.errors.ConnectException: No operations allowed after connection closed. Error code: 0; SQLSTATE: 08003.
      Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
      Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure
      
      ERROR Failed due to error: Aborting snapshot due to error when last running 'COMMIT': No operations allowed after connection closed. (io.debezium.connector.mysql.SnapshotReader)
      Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
      The last packet successfully received from the server was 11 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
      Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
      org.apache.kafka.connect.errors.ConnectException: No operations allowed after connection closed. Error code: 0; SQLSTATE: 08003.
      [2018-04-17 23:58:30,612] INFO [Producer clientId=producer-4] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer)
      [2018-04-17 23:58:30,612] ERROR WorkerSourceTask{id=workerid-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)
      

            [DBZ-685] Debezium fails to snapshot large databases

            Released

            Jiri Pechanec added a comment - Released

            eric.franckx@ngahr.com COuld we move the discussion elswhere if we continue with it? Yes, you should have a watchdog process checking the status of the connector and tasks via REST API and if anything is failed then restart the connector.

            Jiri Pechanec added a comment - eric.franckx@ngahr.com COuld we move the discussion elswhere if we continue with it? Yes, you should have a watchdog process checking the status of the connector and tasks via REST API and if anything is failed then restart the connector.

            Hi,
            regarding:
            Generally I'd avoid notifications but recommend to restart the connector automatically when it is down.

            So we should have a daemon that monitor/cron that check very xx minute sthe status and in case if down, restart it automatically ?

            Regards,
            Eric

            Eric FRANCKX (Inactive) added a comment - Hi, regarding: Generally I'd avoid notifications but recommend to restart the connector automatically when it is down. So we should have a daemon that monitor/cron that check very xx minute sthe status and in case if down, restart it automatically ? Regards, Eric

            Hi,

            • we use Debezium 10 RC2.
            • We don't use container.
            • I will check in the log of MySQL to see
            • For my case, it is during standard operation ... so snapshot already done, and we wait for new data ...

            Regards,
            Eric

            Eric FRANCKX (Inactive) added a comment - Hi, we use Debezium 10 RC2. We don't use container. I will check in the log of MySQL to see For my case, it is during standard operation ... so snapshot already done, and we wait for new data ... Regards, Eric

            Jiri Pechanec added a comment - - edited

            eric.franckx@ngahr.com Hi, which version of Debeizum are you on? Does your lost connections correspong to the timeout you have configured?

            Regarding monitoring and alerting, are you in Docker or do you use normal app? Generally I'd avoid notifications but recommend to restart the connector automatically when it is down.

            Also is tehre anything in the database server log when the connection is lost?

            Also are we talking about snapshotting or streaming?

            Jiri Pechanec added a comment - - edited eric.franckx@ngahr.com Hi, which version of Debeizum are you on? Does your lost connections correspong to the timeout you have configured? Regarding monitoring and alerting, are you in Docker or do you use normal app? Generally I'd avoid notifications but recommend to restart the connector automatically when it is down. Also is tehre anything in the database server log when the connection is lost? Also are we talking about snapshotting or streaming?

            Hi,

            • we seems loosing every day connection to our MySQL Database.
            • The time-out is 1600 as we want to force any apps doing nothing be force to:
            • - be killed and don't was time
            • - respan they connection
              questions:
              what will happen then the connectors if during more than 1600 seconds nothing happen (no data change in MySQL source tables ) ?
              Will the Connector session still stay connect to MySQL or be closed ?

            as in the doc we can see :

            What is the best wait to monitor and alert is a connecter status is not running ? Use Linux daemon to check every xx minutes and based on status send a mail/notification ?

            Regards,
            Eric

            Eric FRANCKX (Inactive) added a comment - Hi, we seems loosing every day connection to our MySQL Database. The time-out is 1600 as we want to force any apps doing nothing be force to: - be killed and don't was time - respan they connection questions: what will happen then the connectors if during more than 1600 seconds nothing happen (no data change in MySQL source tables ) ? Will the Connector session still stay connect to MySQL or be closed ? as in the doc we can see : What is the best wait to monitor and alert is a connecter status is not running ? Use Linux daemon to check every xx minutes and based on status send a mail/notification ? Regards, Eric

            Seems we should add something in the docs/FAQ on increasing the default timeout settings.

            Gunnar Morling added a comment - Seems we should add something in the docs/FAQ on increasing the default timeout settings.

            We have also run into this; after about 8 hours we had a mysql session termination, due to the mysql server settings

            wait_timeout
            interactive_timeout

            both of which coincidentally default to 28800 seconds, which is 8 hours.

            Try increasing the values for the mysql source instance

            Joel Salmeron Viver (Inactive) added a comment - We have also run into this; after about 8 hours we had a mysql session termination, due to the mysql server settings wait_timeout interactive_timeout both of which coincidentally default to 28800 seconds, which is 8 hours. Try increasing the values for the mysql source instance

            mrtrustworthy Is this still valid?

            Jiri Pechanec added a comment - mrtrustworthy Is this still valid?

            Felix Eggert (Inactive) added a comment - - edited

            Acks were left at the default setting (1 IIRC)
            It is possible, but unlikely: The broker machines themselves behaved normally, no special logs on any of the brokers indicated lost connections. The network connection between the connect machine and the brokers seemed (tested via ping, telnet) normal as well at those times.

            Felix Eggert (Inactive) added a comment - - edited Acks were left at the default setting (1 IIRC) It is possible, but unlikely: The broker machines themselves behaved normally, no special logs on any of the brokers indicated lost connections. The network connection between the connect machine and the brokers seemed (tested via ping, telnet) normal as well at those times.

              jpechane Jiri Pechanec
              mrtrustworthy Felix Eggert (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: