The issue is reproducible with the MySQL connector but in theory could be reproduced with other connectors that recover their schema from a database history topic.
If the database schema includes many tables and has a long history, it might be represented by a large history topic (e.g. in our case it's ~4 million messages and ~4GB of data in total). When the connector restarts, the schema recovery can take significant time (30-60 minutes).
From the standpoint of utilization of the database connection, the MySQL task lifecycle looks like the following (see MySqlConnectorTask):
- Connect to the database (supposedly, in order to validate the connection parameters).
- Recover database schema from the topic (may take a while).
- Start consuming events from the binlog.
During the schema recovery, the connection isn't used by the connector and can time out if the schema recovery takes longer than one of the following:
- MySQL session wait_timeout. Might be worked around by increasing the timeout on the server or on the client.
- Underlying TCP connection timeout defined by the network between the client and the server. For instance, the AWS VPC NAT gateway timeout of 350 seconds.
If the timeout happens, then at the moment when the schema is fully recovered, the connector will fail with an error message like the following:
io.debezium.DebeziumException: Unexpected error while connecting to MySQL and looking for binary logs:
Caused by: com.mysql.cj.jdbc.exceptions.CommunicationsException: The last packet successfully received from the server was 2,053,326 milliseconds ago. The last packet sent successfully to the server was 2,053,338 milliseconds ago. is longer than the server configured value of 'wait_timeout'. You should consider either expiring and/or testing connection validity before use in your application, increasing the server configured values for client timeouts, or using the Connector/J connection property 'autoReconnect=true' to avoid this problem.
Note that the 2,053,326 milliseconds above roughly corresponds to the 34 minutes the schema recovery took but the statement "is longer than the server configured value of 'wait_timeout'" is not necessarily relevant (it's produced by the underlying JDBC driver). Also, the suggestion of using autoReconnect=true (also produced by the driver) is also irrelevant since it requires handling of the reconnection by the connector which is currently not implemented.
As a workaround, the logic implemented in https://github.com/sugarcrm/debezium/pull/68 could be used:
- Disconnect before starting schema recovery to avoid the timeout.
- Reconnect after the recovery is completed.