I'd like to extend the functionality of the SQL Server connector to support capturing changes from all databases of a given instance by a single connector.
I did some research on Kafka Connect, the current connector implementation, and the SQL Server CDC feature and put together a document with a high-level design proposal:
Unlike the Debezium connector for MySQL which consumes the changes from all databases via a single binlog, the SQL Server connector interacts with the CDC schema which is deployed individually for each database.
At SugarCRM, we have a number of SQL Server instances hosting more than a hundred databases each, the changes from which we'd like to capture using the Debezium connector for SQL Server. Having to deploy a hundred connectors per instance may be suboptimal both from the resourcing and the operations standpoints.
Unlike the MySQL connector which receives events from the server as they get committed to the binlog, the SQL server connector periodically calls certain stored procedures on the server, and if there are new updates, queries them on a per-table basis.
The above "poll and iterate over tables" logic could be extending by adding the "databases" dimension on top of the existing loop.
- Instead of using a fixed database name from the configuration, the connector lists all CDC-enabled databases on the instance. The list can be filtered through the include and exclude lists, similar to the MySQL connector.
- Instead of a single task, the connector can have multiple tasks, each of which will capture the changes from a subset of the databases.
- In addition to (or instead of?) the SERVER_PARTITION_KEY, the keys of the committed offset messages will contain the database name, so Kafka Connect will commit one offset per database, not one offset per connector.
- During iteration over the databases, the connector will either change the active database on the JDBC connection using a USE <dbname> SQL statement or use a fully qualified name for all stored procedures and table names to avoid a dependency on the global connection state.
- Both the data and the metadata connections will have to execute statements against multiple databases.
- Probably, the snapshots of all databases should be done prior to streaming the changes. Given that the snapshot and the CDC phases are coordinated by the connector-agnostic ChangeEventSourceCoordinator, it's not clear how to implement this.
- The ChangeEventSourceCoordinator class, the OffsetContext, and its loader are designed for a single offset per connector/task while in this case, we need multiple offsets per task (one for each database). This seems supported only by the more generic Kafka Connect's OffsetStorageReader.