Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-7311

PostgreSQL ad-hoc blocking snapshots fail when snapshot mode is "never"

XMLWordPrintable

      In order to make your issue reports as actionable as possible, please provide the following information, depending on the issue type.

      Bug report

      For bug reports, provide this information, please:

      What Debezium connector do you use and what version?

      Debezium PostgreSQL 2.5.0.Final

      What is the connector configuration?

       

      # General connector setup
      name=SnapshotModeNeverCrash
      connector.class=io.debezium.connector.postgresql.PostgresConnector
      tasks.max=1
      
      # Basic database connection information
      database.hostname=10.10.100.1
      database.port=5432
      database.user=cdc_replication
      database.dbname=featureflags
      
      # Logical replication configuration
      publication.name=cdc_publication
      publication.autocreate.mode=disabled
      slot.name=snapshot_mode_never_crash
      plugin.name=pgoutput
      
      # Disable snapshots on initialization
      snapshot.mode=never
      table.include.list=public.ffl_feature_flags
      topic.prefix=snapshot-mode-never-crash
      
      # Signalling and notifications
      signal.enabled.channels=source
      signal.data.collection=framework.dbz_main_cdc_signal_v001
      
      key.converter=io.confluent.connect.avro.AvroConverter
      key.converter.schema.registry.url=http://10.10.100.1:17731
      value.converter=io.confluent.connect.avro.AvroConverter
      value.converter.schema.registry.url=http://10.10.100.1:17731
      

       

      What is the captured database version and mode of depoyment?

      PG 13 running in a local docker container for local-only development.

      Steps to reproduce

      1. Create test database in PostgreSQL with a signal collection table and a sample data table to snapshot.
      2. Create a connector with the configuration similar to what I shared above.  The key part is to set snapshot.mode to never.
      3. Run SQL similar to the following to run an ad-hoc blocking snapshot.
      INSERT INTO framework.dbz_main_cdc_signal_v001 (signal_id, type, data)
      VALUES (gen_random_uuid()::text, 'execute-snapshot', '{"data-collections": ["public.ffl_feature_flags"], "type": "blocking"}')
      RETURNING signal_id;

      What behaviour do you expect?

      An ad-hoc blocking snapshot should be taken of the requested table as normal.

      What behaviour do you see?

      The connector dies with the following error.

       

      [2024-01-04 06:38:16,368] INFO [SnapshotModeNeverCrash|task-0] Requested 'BLOCKING' snapshot of data collections '[public.ffl_feature_flags]' with additional conditions '[]' and surrogate key 'PK of table will be used' (io.debezium.pipeline.signal.actions.snapshotting.ExecuteSnapshot:64)
      [2024-01-04 06:38:16,371] INFO [SnapshotModeNeverCrash|task-0] Creating thread debezium-postgresconnector-snapshot-mode-never-crash-blocking-snapshot (io.debezium.util.Threads:288)
      [2024-01-04 06:38:16,376] INFO [SnapshotModeNeverCrash|task-0] Streaming will now pause (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:246)
      [2024-01-04 06:38:16,376] INFO [SnapshotModeNeverCrash|task-0] Starting snapshot (io.debezium.pipeline.ChangeEventSourceCoordinator:215)
      [2024-01-04 06:38:16,378] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 1 - Preparing (io.debezium.relational.RelationalSnapshotChangeEventSource:121)
      [2024-01-04 06:38:16,397] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 2 - Determining captured tables (io.debezium.relational.RelationalSnapshotChangeEventSource:130)
      [2024-01-04 06:38:16,402] INFO [SnapshotModeNeverCrash|task-0] Created connection pool with 1 threads (io.debezium.relational.RelationalSnapshotChangeEventSource:222)
      [2024-01-04 06:38:16,402] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 3 - Locking captured tables [public.ffl_feature_flags] (io.debezium.relational.RelationalSnapshotChangeEventSource:139)
      [2024-01-04 06:38:16,403] INFO [SnapshotModeNeverCrash|task-0] Waiting a maximum of '10' seconds for each table lock (io.debezium.connector.postgresql.PostgresSnapshotChangeEventSource:118)
      [2024-01-04 06:38:16,432] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 4 - Determining snapshot offset (io.debezium.relational.RelationalSnapshotChangeEventSource:145)
      [2024-01-04 06:38:16,432] INFO [SnapshotModeNeverCrash|task-0] Creating initial offset context (io.debezium.connector.postgresql.PostgresOffsetContext:233)
      [2024-01-04 06:38:16,434] INFO [SnapshotModeNeverCrash|task-0] Read xlogStart at 'LSN{1/554F4BD0}' from transaction '133249' (io.debezium.connector.postgresql.PostgresOffsetContext:236)
      [2024-01-04 06:38:16,436] INFO [SnapshotModeNeverCrash|task-0] Read xlogStart at 'LSN{1/554F4BD0}' from transaction '133249' (io.debezium.connector.postgresql.PostgresSnapshotChangeEventSource:155)
      [2024-01-04 06:38:16,436] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 5 - Reading structure of captured tables (io.debezium.relational.RelationalSnapshotChangeEventSource:148)
      [2024-01-04 06:38:16,436] INFO [SnapshotModeNeverCrash|task-0] Reading structure of schema 'public' of catalog 'featureflags' (io.debezium.connector.postgresql.PostgresSnapshotChangeEventSource:201)
      [2024-01-04 06:38:16,454] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 6 - Persisting schema history (io.debezium.relational.RelationalSnapshotChangeEventSource:152)
      [2024-01-04 06:38:16,454] INFO [SnapshotModeNeverCrash|task-0] Snapshot step 7 - Snapshotting data (io.debezium.relational.RelationalSnapshotChangeEventSource:164)
      [2024-01-04 06:38:16,456] INFO [SnapshotModeNeverCrash|task-0] Creating snapshot worker pool with 1 worker thread(s) (io.debezium.relational.RelationalSnapshotChangeEventSource:413)
      [2024-01-04 06:38:16,470] ERROR [SnapshotModeNeverCrash|task-0] Error during snapshot (io.debezium.relational.RelationalSnapshotChangeEventSource:179)
      java.lang.UnsupportedOperationException: 'never' snapshot mode cannot build queries
          at io.debezium.connector.postgresql.snapshot.NeverSnapshotter.buildSnapshotQuery(NeverSnapshotter.java:50)
          at io.debezium.connector.postgresql.PostgresSnapshotChangeEventSource.getSnapshotSelect(PostgresSnapshotChangeEventSource.java:243)
          at io.debezium.relational.RelationalSnapshotChangeEventSource.determineSnapshotSelect(RelationalSnapshotChangeEventSource.java:651)
          at io.debezium.relational.RelationalSnapshotChangeEventSource.createDataEvents(RelationalSnapshotChangeEventSource.java:420)
          at io.debezium.relational.RelationalSnapshotChangeEventSource.doExecute(RelationalSnapshotChangeEventSource.java:165)
          at io.debezium.pipeline.source.AbstractSnapshotChangeEventSource.execute(AbstractSnapshotChangeEventSource.java:92)
          at io.debezium.pipeline.ChangeEventSourceCoordinator.doSnapshot(ChangeEventSourceCoordinator.java:253)
          at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$doBlockingSnapshot$7(ChangeEventSourceCoordinator.java:218)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:829)
      [2024-01-04 06:38:16,487] INFO [SnapshotModeNeverCrash|task-0] Snapshot - Final stage (io.debezium.pipeline.source.AbstractSnapshotChangeEventSource:104)
      [2024-01-04 06:38:16,488] WARN [SnapshotModeNeverCrash|task-0] Snapshot was not completed successfully, it will be re-executed upon connector restart (io.debezium.pipeline.source.AbstractSnapshotChangeEventSource:115)
      [2024-01-04 06:38:16,724] INFO [SnapshotModeNeverCrash|task-0] 1 records sent during previous 00:07:21.005, last recorded offset of {server=snapshot-mode-never-crash} partition is {transaction_id=null, lsn_proc=5726226976, messageType=INSERT, lsn_commit=5726125176, lsn=5726226976, txId=133248, ts_usec=1704350295932004} (io.debezium.connector.common.BaseSourceTask:215)
      [2024-01-04 06:38:16,990] WARN [SnapshotModeNeverCrash|task-0] [Producer clientId=connector-producer-SnapshotModeNeverCrash-0, transactionalId=kafka-connect-yoda-SnapshotModeNeverCrash-0] Error while fetching metadata with correlation id 6 : {snapshot-mode-never-crash.framework.dbz_main_cdc_signal_v001=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1145)
      [2024-01-04 06:38:17,091] WARN [SnapshotModeNeverCrash|task-0] [Producer clientId=connector-producer-SnapshotModeNeverCrash-0, transactionalId=kafka-connect-yoda-SnapshotModeNeverCrash-0] Error while fetching metadata with correlation id 7 : {snapshot-mode-never-crash.framework.dbz_main_cdc_signal_v001=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1145)
      [2024-01-04 06:38:17,193] WARN [SnapshotModeNeverCrash|task-0] [Producer clientId=connector-producer-SnapshotModeNeverCrash-0, transactionalId=kafka-connect-yoda-SnapshotModeNeverCrash-0] Error while fetching metadata with correlation id 8 : {snapshot-mode-never-crash.framework.dbz_main_cdc_signal_v001=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1145)
      [2024-01-04 06:38:17,295] WARN [SnapshotModeNeverCrash|task-0] [Producer clientId=connector-producer-SnapshotModeNeverCrash-0, transactionalId=kafka-connect-yoda-SnapshotModeNeverCrash-0] Error while fetching metadata with correlation id 9 : {snapshot-mode-never-crash.framework.dbz_main_cdc_signal_v001=UNKNOWN_TOPIC_OR_PARTITION} (org.apache.kafka.clients.NetworkClient:1145)
      <snip many more>
      [2024-01-04 06:39:56,075] INFO [SnapshotModeNeverCrash|worker] [Consumer clientId=connector-consumer-SnapshotModeNeverCrash, groupId=connect-SnapshotModeNeverCrash] Node -1 disconnected. (org.apache.kafka.clients.NetworkClient:977)
      [2024-01-04 06:39:56,576] INFO [SnapshotModeNeverCrash|task-0] [Consumer clientId=connector-consumer-SnapshotModeNeverCrash-0, groupId=connect-SnapshotModeNeverCrash] Node -1 disconnected. (org.apache.kafka.clients.NetworkClient:977)
      [2024-01-04 06:40:22,176] INFO [SnapshotModeNeverCrash|task-0] [Producer clientId=connector-producer-SnapshotModeNeverCrash-0, transactionalId=kafka-connect-yoda-SnapshotModeNeverCrash-0] Node -1 disconnected. (org.apache.kafka.clients.NetworkClient:977)
      [2024-01-04 06:40:22,180] INFO [SnapshotModeNeverCrash|task-0] [Producer clientId=connector-producer-SnapshotModeNeverCrash-0, transactionalId=kafka-connect-yoda-SnapshotModeNeverCrash-0] Resetting the last seen epoch of partition snapshot-mode-never-crash.framework.dbz_main_cdc_signal_v001-0 to 0 since the associated topicId changed from null to BBVFsySiRKS9ZK_yDgxEGw (org.apache.kafka.clients.Metadata:408)
      [2024-01-04 06:40:55,898] INFO [SnapshotModeNeverCrash|worker] [AdminClient clientId=connector-adminclient-SnapshotModeNeverCrash] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient:977)
      [2024-01-04 06:40:55,906] INFO [SnapshotModeNeverCrash|task-0] [AdminClient clientId=connector-adminclient-SnapshotModeNeverCrash-0] Node 1 disconnected. (org.apache.kafka.clients.NetworkClient:977)

       

      After these events have been logged, the connector and task remain in a "healthy" running state.  However, streaming actually remains stopped, and no further updates from the WAL are streamed into Kafka. (try inserting a row into the table after the failure, and observe that it doesn't make it past Debezium any more)

      If I then attempt to restart the connector, the following error is logged when shutting down the connector, indicating it cannot gracefully stop:

       

      [2024-01-04 06:47:29,465] INFO [SnapshotModeNeverCrash|worker] Stopping connector SnapshotModeNeverCrash (org.apache.kafka.connect.runtime.Worker:408)
      [2024-01-04 06:47:29,466] INFO [SnapshotModeNeverCrash|worker] Scheduled shutdown for WorkerConnector{id=SnapshotModeNeverCrash} (org.apache.kafka.connect.runtime.WorkerConnector:267)
      [2024-01-04 06:47:29,473] INFO [SnapshotModeNeverCrash|worker] Completed shutdown for WorkerConnector{id=SnapshotModeNeverCrash} (org.apache.kafka.connect.runtime.WorkerConnector:287)
      [2024-01-04 06:47:29,475] INFO [SnapshotModeNeverCrash|worker] Stopping KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore:293)
      [2024-01-04 06:47:29,475] INFO [SnapshotModeNeverCrash|worker] Stopping KafkaBasedLog for topic connect-offsets-yoda (org.apache.kafka.connect.util.KafkaBasedLog:292)
      [2024-01-04 06:47:29,478] INFO [SnapshotModeNeverCrash|worker] [Consumer clientId=connector-consumer-SnapshotModeNeverCrash, groupId=connect-SnapshotModeNeverCrash] Resetting generation and member id due to: consumer pro-actively leaving the group (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1025)
      [2024-01-04 06:47:29,478] INFO [SnapshotModeNeverCrash|worker] [Consumer clientId=connector-consumer-SnapshotModeNeverCrash, groupId=connect-SnapshotModeNeverCrash] Request joining group due to: consumer pro-actively leaving the group (org.apache.kafka.clients.consumer.internals.ConsumerCoordinator:1072)
      [2024-01-04 06:47:29,569] INFO [SnapshotModeNeverCrash|worker] Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics:693)
      [2024-01-04 06:47:29,569] INFO [SnapshotModeNeverCrash|worker] Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics:697)
      [2024-01-04 06:47:29,569] INFO [SnapshotModeNeverCrash|worker] Metrics reporters closed (org.apache.kafka.common.metrics.Metrics:703)
      [2024-01-04 06:47:29,571] INFO [SnapshotModeNeverCrash|worker] App info kafka.consumer for connector-consumer-SnapshotModeNeverCrash unregistered (org.apache.kafka.common.utils.AppInfoParser:83)
      [2024-01-04 06:47:29,571] INFO [SnapshotModeNeverCrash|worker] Stopped KafkaBasedLog for topic connect-offsets-yoda (org.apache.kafka.connect.util.KafkaBasedLog:316)
      [2024-01-04 06:47:29,571] INFO [SnapshotModeNeverCrash|worker] Stopped KafkaOffsetBackingStore (org.apache.kafka.connect.storage.KafkaOffsetBackingStore:301)
      [2024-01-04 06:47:29,572] INFO [SnapshotModeNeverCrash|worker] App info kafka.admin.client for connector-adminclient-SnapshotModeNeverCrash unregistered (org.apache.kafka.common.utils.AppInfoParser:83)
      [2024-01-04 06:47:29,573] INFO [SnapshotModeNeverCrash|worker] Metrics scheduler closed (org.apache.kafka.common.metrics.Metrics:693)
      [2024-01-04 06:47:29,573] INFO [SnapshotModeNeverCrash|worker] Closing reporter org.apache.kafka.common.metrics.JmxReporter (org.apache.kafka.common.metrics.Metrics:697)
      [2024-01-04 06:47:29,573] INFO [SnapshotModeNeverCrash|worker] Metrics reporters closed (org.apache.kafka.common.metrics.Metrics:703)
      [2024-01-04 06:47:29,573] INFO [SnapshotModeNeverCrash|task-0] Stopping task SnapshotModeNeverCrash-0 (org.apache.kafka.connect.runtime.Worker:1002)
      [2024-01-04 06:47:29,780] INFO [SnapshotModeNeverCrash|task-0] Stopping down connector (io.debezium.connector.common.BaseSourceTask:284)
      [2024-01-04 06:47:34,575] ERROR [SnapshotModeNeverCrash|task-0] Graceful stop of task SnapshotModeNeverCrash-0 failed. (org.apache.kafka.connect.runtime.Worker:1029) 

      The connector then starts again.  During startup, it logs an error failing to register the MBean because it already exists (probably due to non-graceful stop).

      During the restart, it continues to be quite messy:

      [2024-01-04 06:47:35,949] WARN [SnapshotModeNeverCrash|task-0] Unable to register metrics as an old set with the same name exists, retrying in PT5S (attempt 1 out of 12) (io.debezium.pipeline.JmxUtils:55)
      [2024-01-04 06:47:40,951] WARN [SnapshotModeNeverCrash|task-0] Unable to register metrics as an old set with the same name exists, retrying in PT5S (attempt 2 out of 12) (io.debezium.pipeline.JmxUtils:55) [2024-01-04 06:47:45,952] WARN [SnapshotModeNeverCrash|task-0] Unable to register metrics as an old set with the same name exists, retrying in PT5S (attempt 3 out of 12) (io.debezium.pipeline.JmxUtils:55)
      <snip many more>
      [2024-01-04 06:48:20,960] WARN [SnapshotModeNeverCrash|task-0] Unable to register metrics as an old set with the same name exists, retrying in PT5S (attempt 10 out of 12) (io.debezium.pipeline.JmxUtils:55)
      [2024-01-04 06:48:25,961] WARN [SnapshotModeNeverCrash|task-0] Unable to register metrics as an old set with the same name exists, retrying in PT5S (attempt 11 out of 12) (io.debezium.pipeline.JmxUtils:55)
      [2024-01-04 06:48:30,963] ERROR [SnapshotModeNeverCrash|task-0] Failed to register metrics MBean, metrics will not be available (io.debezium.pipeline.JmxUtils:62) 

      After all this lengthy, abnormal behavior, the records from the WAL from before the restart do eventually make it through.

      Note that the original error said: "Snapshot was not completed successfully, it will be re-executed upon connector restart".  However, nothing of the sort happens.

      Probably the only real way to recover is to reboot the entire Kafka Connect node, which I didn't try.

       

      Do you see the same behaviour using the latest released Debezium version?

      Yes, with Debezium 2.5.0.Final

      Do you have the connector logs, ideally from start till finish?

      (You might be asked later to provide DEBUG/TRACE level log)

      See above.

      How to reproduce the issue using our tutorial deployment?

      See above for steps to reproduce.  Really all you need to do I think is try to run a blocking ad-hoc snapshot whenever snapshot.mode is set to "never".

      Additional commentary

      I assume this issue arises because blocking ad-hoc snapshots (logically) reuse most of the source code from the initial snapshots, and this validation check was not updated to account for blocking ad-hoc snapshots when the feature was added.

      I do think that running blocking ad-hoc snapshots should be a valid action to take, even when snapshot.mode is set to never.  This way we can decide on our own when to take the snapshots, like during lower traffic hours.

              rh-ee-mvitale Mario Fiore Vitale
              james-johnston-thumbtack James Johnston (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: