Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 1.0.0.Beta1
Affects Version/s: 0.7.5
Component/s: mysql-connector
Labels:
- qe-investigate
Environment:

Hide

Relevant worker settings:
Xmx10g (Note: before setting that, the error message would also include an OOM error. The worker machine has a total of 16g RAM)
offset.flush.interval.ms: "30000"
offset.flush.timeout.ms: "25000"

Relevant connect settings:
"database.history.kafka.recovery.poll.interval.ms": "1000",
"database.history.kafka.recovery.attempts": "120",
"max.batch.size": "16384",
"max.queue.size": "65536",
"connect.timeout.ms": "120000"

Database: MySQL 5.6
Kafka 1.0.0

Show
Relevant worker settings: Xmx10g (Note: before setting that, the error message would also include an OOM error. The worker machine has a total of 16g RAM) offset.flush.interval.ms: "30000" offset.flush.timeout.ms: "25000" Relevant connect settings: "database.history.kafka.recovery.poll.interval.ms": "1000", "database.history.kafka.recovery.attempts": "120", "max.batch.size": "16384", "max.queue.size": "65536", "connect.timeout.ms": "120000" Database: MySQL 5.6 Kafka 1.0.0

Git Pull Request:
https://github.com/debezium/debezium/pull/1059
Steps to Reproduce:

Hide

Run debezium on a ~100G database (whitelisted tables are 100G in size - only snapshotted database size is relevant)

Show
Run debezium on a ~100G database (whitelisted tables are 100G in size - only snapshotted database size is relevant)

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Logs for the first few hours are all fine:

[2018-04-17 08:47:54,330] INFO Step 8: - 4730000 of <unknown> rows scanned from table 'mydb.myothertable' after 00:32:52.806 (io.debezium.connector.mysql.SnapshotReader)
[2018-04-17 08:47:54,376] INFO Step 8: - 4740000 of <unknown> rows scanned from table 'mydb.myothertable' after 00:32:52.852 (io.debezium.connector.mysql.SnapshotReader)
[2018-04-17 08:47:56,948] INFO WorkerSourceTask{id=Live-Source-Debez-MySQL-ClusterA-Worker-Connector-0} Finished commitOffsets successfully in 20359 ms (org.apache.kafka.connect.runtime.WorkerSourceTask)

At one point, however, all flushes start to fail, and the scanning seems to stop during that time:

[2018-04-17 08:48:11,676] INFO Step 8: - 70000 of <unknown> rows scanned from table 'mydb.mytable' after 00:00:12.501 (io.debezium.connector.mysql.SnapshotReader)
[2018-04-17 08:48:11,723] INFO Step 8: - 80000 of <unknown> rows scanned from table 'mydb.mytable' after 00:00:12.454 (io.debezium.connector.mysql.SnapshotReader)
...
[2018-04-17 08:48:26,958] INFO WorkerSourceTask{id=workerid-0} flushing 118670 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-04-17 08:48:26,958] INFO WorkerSourceTask{id=workerid-0} Committing offsets (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-04-17 08:48:51,958] ERROR WorkerSourceTask{id=workerid-0} Failed to flush, timed out while waiting for producer to flush outstanding 118670 messages (org.apache.kafka.connect.runtime.WorkerSourceTask)
[2018-04-17 08:48:51,959] ERROR WorkerSourceTask{id=workerid-0} Failed to commit offsets (org.apache.kafka.connect.runtime.SourceTaskOffsetCommitter)

This will happen 20 times over the next 10 minutes (this is because of the 30s flush interval). Note that the AMOUNT of flushed messages (118670 in this example) is exactly the same during all of those flush attempts.
After that, two warnings are shown. They differ in the topic, with the "myothertable" being the table that was scanned before the current one. This somehow makes sense, since the amount of pending messages (118670) is more than we've scanned from the current table so far, so some of the pending messages are still from the previous table. Also note that the network connection during that time is completely fine - requests in both directions go through.

[2018-04-17 08:58:11,349] WARN [Producer clientId=producer-4] Got error produce response with correlation id 2895200 on topic-partition mytable-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)
[2018-04-17 08:58:11,349] WARN [Producer clientId=producer-4] Got error produce response with correlation id 2895200 on topic-partition myothertable-0, retrying (2147483646 attempts left). Error: NETWORK_EXCEPTION (org.apache.kafka.clients.producer.internals.Sender)

Then, polling resumes for a few seconds before a commit is attempted:

[2018-04-17 08:58:13,047] INFO Step 8: - 90000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:13.825 (io.debezium.connector.mysql.SnapshotReader)
[2018-04-17 08:58:13,095] INFO Step 8: - 100000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:13.873 (io.debezium.connector.mysql.SnapshotReader)
[2018-04-17 08:58:15,086] INFO Step 8: - 110000 of <unknown> rows scanned from table 'mydb.mytable' after 00:10:15.864 (io.debezium.connector.mysql.SnapshotReader)
[2018-04-17 08:58:15,105] INFO Step 8: committing transaction (io.debezium.connector.mysql.SnapshotReader)

Then, the crash happens. My logging somehow garbles the output, so I'm not entirely sure of the correct ordering of tracebacks, but the messages involved are:

org.apache.kafka.connect.errors.ConnectException: No operations allowed after connection closed. Error code: 0; SQLSTATE: 08003.
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
Caused by: com.mysql.jdbc.exceptions.jdbc4.CommunicationsException: Communications link failure

ERROR Failed due to error: Aborting snapshot due to error when last running 'COMMIT': No operations allowed after connection closed. (io.debezium.connector.mysql.SnapshotReader)
Caused by: java.io.EOFException: Can not read response from server. Expected to read 4 bytes, read 0 bytes before connection was unexpectedly lost.
The last packet successfully received from the server was 11 milliseconds ago.  The last packet sent successfully to the server was 0 milliseconds ago.
Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLNonTransientConnectionException: No operations allowed after connection closed.
org.apache.kafka.connect.errors.ConnectException: No operations allowed after connection closed. Error code: 0; SQLSTATE: 08003.
[2018-04-17 23:58:30,612] INFO [Producer clientId=producer-4] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer)
[2018-04-17 23:58:30,612] ERROR WorkerSourceTask{id=workerid-0} Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

screenshot-1.png
17 kB
2019/09/30 11:06 AM

Details

Description

Attachments

Attachments

Easy Agile Planning Poker

Activity

People

Dates