Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-2053

Support snapshotting new tables added to existing postgres publication/connector

    XMLWordPrintable

Details

    Description

      Below I propose 3 potential solution to this issue. We are likely to use Debezium in a way where we will need to add tables to it at least dozens of times per year. So this particular feature is an important one for us.

      1. With initial snapshot mode, we could be snapshotting new tables using a similar workflow to how postgres handles snapshotting new tables added to an existing logical replication publication/subscription.

      For example, you can add a table in postgres to replication on the subscriber, with a command like so:
      Publisher:
      ```
      ALTER PUBLICATION foo ADD TABLE bar;
      ```
      Subscriber:
      ```
      ALTER SUBSCRIPTION foo_sub REFRESH PUBLICATION WITH (copy_data = true);
      ```

      What happens then is explained well in this brief summary:
      https://www.postgresql.org/docs/12/logical-replication-architecture.html#LOGICAL-REPLICATION-SNAPSHOT

      "The initial data in existing subscribed tables are snapshotted and copied in a parallel instance of a special kind of apply process. This process will create its own temporary replication slot and copy the existing data. Once existing data is copied, the worker enters synchronization mode, which ensures that the table is brought up to a synchronized state with the main apply process by streaming any changes that happened during the initial data copy using standard logical replication. Once the synchronization is done, the control of the replication of the table is given back to the main apply process where the replication continues as normal."

      So in terms of what could be done in Debezium:

      • We create another, temporary replication slot to queue changes during copy. At this point we also ignore published changes for this table from the main replication slot.
      • Then we copy the new table data into kafka
      • Once finished, we read all data from the replication slot until the end
      • At this point, we will have two LSN events for each write in each replication slot, the main and copy one, for the same table. I believe we would need to do something like record the last LSN of the last event from the copy slot, then start reading again from the main slot only following that LSN. Then we drop the copy slot. Of course, we would need to ensure we saw the "same last event" in the main slot so that we can verify we did not skip data.

      This is a complex operation and one solution to the problem posed by DBZ-1961, which is really a workaround to having this functionality built-in to snapshot new tables.

      2. A second, different solution to this problem would be supporting something like `snapshot.table.whitelist`. This idea then is you can actually list in the same connector as the main one a list of tables to snapshot, rather than inheriting from the other configs. This would allow me for example to add a table to debezium config, and then just do a one-time snapshot by restarting the connector with snapshot.mode = always and snapshot.table.whitelist = my_new.table.

      3. Yet a third solution is what Jiri suggested in the same ticket - support another snapshot mode which allows me to create a "snapshot only connector" where I do not want to stream changes at all after the snapshot.

      I really appreciate any input here. Thank you!

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jfinzel Jeremy Finzel
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated: