While the existing error handler implements handling retriable connection errors during polling the task by Kafka Connect, the same logic doesn't apply to the task start. It means that if the underlying connection issue doesn't get fixed within retriable.restart.connector.wait.ms, the task will never recover.
See an excerpt from the worker logs for the details of what is happening:
- The streaming change source catches an exception from the database. The error handler parses the error message, recognizes that it's a retriable error, and converts it to a RetriableException, stores it in the queue, the task polls the queue and throws it.
- BaseSourceTask catches the retriable exception and restarts the task.
- The task restarts, attempts to connect the database, and fails because the server is still shutting down. This time, the exception is thrown by the SqlServerConnectorTask#start, not SqlServerConnectorTask#poll, so it doesn't trigger the retry logic.
- Kafka Connect kills the task.
The same issue is reproducible with the example from debezium-tutorials by stopping the SQL Server instance while the connector is up.