Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-2975

Support for Multiple Databases per SQL Server Connector

    • Icon: Enhancement Enhancement
    • Resolution: Done
    • Icon: Major Major
    • 1.8.0.Beta1
    • 1.4.0.Final
    • sqlserver-connector
    • None
    • False
    • False
    • Undefined

      I'd like to extend the functionality of the SQL Server connector to support capturing changes from all databases of a given instance by a single connector.

      I did some research on Kafka Connect, the current connector implementation, and the SQL Server CDC feature and put together a document with a high-level design proposal:

      Background

      Unlike the Debezium connector for MySQL which consumes the changes from all databases via a single binlog, the SQL Server connector interacts with the CDC schema which is deployed individually for each database.

      At SugarCRM, we have a number of SQL Server instances hosting more than a hundred databases each, the changes from which we'd like to capture using the Debezium connector for SQL Server. Having to deploy a hundred connectors per instance may be suboptimal both from the resourcing and the operations standpoints.

      Unlike the MySQL connector which receives events from the server as they get committed to the binlog, the SQL server connector periodically calls certain stored procedures on the server, and if there are new updates, queries them on a per-table basis.

      Proposal

      The above "poll and iterate over tables" logic could be extending by adding the "databases" dimension on top of the existing loop.

      1. Instead of using a fixed database name from the configuration, the connector lists all CDC-enabled databases on the instance. The list can be filtered through the include and exclude lists, similar to the MySQL connector.
      2. Instead of a single task, the connector can have multiple tasks, each of which will capture the changes from a subset of the databases.
      3. In addition to (or instead of?) the SERVER_PARTITION_KEY, the keys of the committed offset messages will contain the database name, so Kafka Connect will commit one offset per database, not one offset per connector.
      4. During iteration over the databases, the connector will either change the active database on the JDBC connection using a USE <dbname> SQL statement or use a fully qualified name for all stored procedures and table names to avoid a dependency on the global connection state.

      Technical details and challenges (identified so far)

      1. Both the data and the metadata connections will have to execute statements against multiple databases.
      2. Probably, the snapshots of all databases should be done prior to streaming the changes. Given that the snapshot and the CDC phases are coordinated by the connector-agnostic ChangeEventSourceCoordinator, it's not clear how to implement this.
      3. The ChangeEventSourceCoordinator class, the OffsetContext, and its loader are designed for a single offset per connector/task while in this case, we need multiple offsets per task (one for each database). This seems supported only by the more generic Kafka Connect's OffsetStorageReader.

            [DBZ-2975] Support for Multiple Databases per SQL Server Connector

            sergeimorozov, merged the PRs related to the offset context change. Thx a lot.

            Gunnar Morling added a comment - sergeimorozov , merged the PRs related to the offset context change. Thx a lot.

            gunnar.morling, thanks for setting up the repo! Please see debezium/debezium-design-documents#1.

            Sergei Morozov (Inactive) added a comment - gunnar.morling , thanks for setting up the repo! Please see debezium/debezium-design-documents#1 .

            Hey sergeimorozov, ok, excellent, let's do it that way. I've added the repo here: https://github.com/debezium/debezium-design-documents/blob/main/README.md. Let's see how it goes. Thanks a lot

            Gunnar Morling added a comment - Hey sergeimorozov , ok, excellent, let's do it that way. I've added the repo here: https://github.com/debezium/debezium-design-documents/blob/main/README.md . Let's see how it goes. Thanks a lot

            [...] a Gist makes it tough (impossible?) to comment on it.

            Uh... I didn't take that into account.

            I could create that repo, add you as a collaborator and we give it a try?

            I think it could be just a repo with a README to which I'd create a PR with a new document (DIP-1? ). We can work on it as a regular file change w/o me being a collaborator.

            Sergei Morozov (Inactive) added a comment - [...] a Gist makes it tough (impossible?) to comment on it. Uh... I didn't take that into account. I could create that repo, add you as a collaborator and we give it a try? I think it could be just a repo with a README to which I'd create a PR with a new document (DIP-1? ). We can work on it as a regular file change w/o me being a collaborator.

            sergeimorozov, I think GitHub is fine, but a Gist makes it tough (impossible?) to comment on it. Perhaps we should have a repo for design documents under the debezium org, and there we could work with PRs / regular commits? Just thinking out loud here, it's the first time we're doing this. I could create that repo, add you as a collaborator and we give it a try?

            Gunnar Morling added a comment - sergeimorozov , I think GitHub is fine, but a Gist makes it tough (impossible?) to comment on it. Perhaps we should have a repo for design documents under the debezium org, and there we could work with PRs / regular commits? Just thinking out loud here, it's the first time we're doing this. I could create that repo, add you as a collaborator and we give it a try?

            gunnar.morling here's the document: https://gist.github.com/morozov/1157990ccd7ca00ca89637da914a4c08. I hope it's okay if I maintain it on GitHub since it's easier for me to use its Markdown dialect than the one from Jira. Please let me know if you'd prefer it to be maintained here.

            Sergei Morozov (Inactive) added a comment - - edited gunnar.morling  here's the document: https://gist.github.com/morozov/1157990ccd7ca00ca89637da914a4c08 . I hope it's okay if I maintain it on GitHub since it's easier for me to use its Markdown dialect than the one from Jira. Please let me know if you'd prefer it to be maintained here.

            Hey sergeimorozov, the other day you shared a design document for this issue, but I can't find it any longer Could you add a link to it here? Thanks!

            Gunnar Morling added a comment - Hey sergeimorozov , the other day you shared a design document for this issue, but I can't find it any longer Could you add a link to it here? Thanks!

            Excellent; sent you an invite to the address linked to your Jira account.

            Gunnar Morling added a comment - Excellent; sent you an invite to the address linked to your Jira account.

            the issue is which offset to acknowledge with the database?

            I see that some kind of acknowledgment with the database is implemented in the PostgreSQL connector (PostgresStreamingChangeEventSource#commitOffset()). Is this what you're talking about? I cannot find anything like this in the SQL Server connector.

            Any preference between Monday or Tuesday?

            Tuesday is better. Apparently, Monday is a US holiday.

            Sergei Morozov (Inactive) added a comment - the issue is which offset to acknowledge with the database? I see that some kind of acknowledgment with the database is implemented in the PostgreSQL connector ( PostgresStreamingChangeEventSource#commitOffset() ). Is this what you're talking about? I cannot find anything like this in the SQL Server connector. Any preference between Monday or Tuesday? Tuesday is better. Apparently, Monday is a US holiday.

            the offset of each database will be confirmed and committed independently...

            Yes, I think things are fine from a Kafka Connect perspective, but the issue is which offset to acknowledge with the database? This must be coordinated between all the tasks, so to make sure no task prematurely acknowledges any position which still needs processing by other tasks.

            I believe this should happen at the connector level.

            Yes, you're right, that's what I meant to say. The connector would prepare the right set of configuration(s) and pass a dedicated one to each task that gets created.

            I'd love that. Something around 9 AM PST (6 PM CET) would be great.

            Ok, I'll set something up for early next week at that time. Any preference between Monday or Tuesday?

            Gunnar Morling added a comment - the offset of each database will be confirmed and committed independently... Yes, I think things are fine from a Kafka Connect perspective, but the issue is which offset to acknowledge with the database? This must be coordinated between all the tasks, so to make sure no task prematurely acknowledges any position which still needs processing by other tasks. I believe this should happen at the connector level. Yes, you're right, that's what I meant to say. The connector would prepare the right set of configuration(s) and pass a dedicated one to each task that gets created. I'd love that. Something around 9 AM PST (6 PM CET) would be great. Ok, I'll set something up for early next week at that time. Any preference between Monday or Tuesday?

              Unassigned Unassigned
              sergeimorozov Sergei Morozov (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: