Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-9337

Data loss occurs when connector restarts after failed ad-hoc blocking snapshot

XMLWordPrintable

      In order to make your issue reports as actionable as possible, please provide the following information, depending on the issue type.

      Bug report

      For bug reports, provide this information, please:

      What Debezium connector do you use and what version?

      3.0.8 Final

      What is the connector configuration?

      connector.class = io.debezium.connector.postgresql.PostgresConnector
         max.queue.size = 3
         slot.name = slot1
         record.processing.shutdown.timeout.ms = 1000
         publication.name = publication1
         signal.enabled.channels = in-process
         record.processing.order = ORDERED
         topic.prefix = topic1
         offset.storage.file.filename = itc-3bcf4a43-c0be-4adf-8962-40adedf7449b.offsets
         record.processing.threads = 
         errors.retry.delay.initial.ms = 300
         value.converter = org.apache.kafka.connect.json.JsonConverter
         key.converter = org.apache.kafka.connect.json.JsonConverter
         publication.autocreate.mode = filtered
         database.user = test
         database.dbname = test
         offset.storage = org.apache.kafka.connect.storage.FileOffsetBackingStore
         offset.flush.timeout.ms = 5000
         errors.retry.delay.max.ms = 10000
         database.port = 32769
         plugin.name = pgoutput
         offset.flush.interval.ms = 1000
         internal.task.management.timeout.ms = 8000000
         record.processing.with.serial.consumer = false
         errors.max.retries = -1
         database.hostname = localhost
         database.password = ********
         name = issue-test-connector
         table.include.list = public.table1
         skipped.operations = none
         max.batch.size = 2
         snapshot.mode = initial
      
      
       

      What is the captured database version and mode of deployment?

      (E.g. on-premises, with a specific cloud provider, etc.)

      Postgres version 14 

      What behavior do you expect?

      No data loss even in case connector restarts after failed ad-hoc blocking snapshot.

      What behavior do you see?

      The connector permanently loses portion of data that was inserted while it was offline.

      Do you see the same behaviour using the latest released Debezium version?

      Yes I have tested with ( 3.2.0.Final and 3.3.0.Alpha1)

      Do you have the connector logs, ideally from start till finish?

      Yes

      How to reproduce the issue using our tutorial deployment?

       
      Steps to reproduce the same

      1. Setup: Create two PostgreSQL tables (table1, table2) with table2 containing some "bad" records that will cause processing failures
      2. Initial Run: Start Debezium connector monitoring only table1, let it process initial data then stop it.
      3. Offline Data Insertion: While connector is stopped, bulk insert data into table1.
      4. Restart with Additional table: Restart connector monitoring both tables (table1, table2), trigger snapshot on table2 which will fail due to bad data.
      5. Final Restart: Restart connector again and insert some more data (to start streaming) and observe that portion of data that was inserted while the connector was offline is lost.

      Here is a reproducible test case demonstrating the same that can be run independently  Test Case

       
      The issue appears to be related to offset handling when snapshot processing fails. When the connector restarts, it resumes from these incorrect / bad offsets, causing it to skip streaming events that occurred during the downtime.
      above test also includes a potential workaround (not an actual fix) where it tries to clear the pending offset so that streaming can resume from where it left off.
       
      On restart, the connector uses this kind of offset position as its starting point, resulting in data loss

      {
        "last_snapshot_record": false,
        "lsn": 3491303656,
        "txId": 14100,
        "ts_usec": 1747335133620186,
        "snapshot": "BLOCKING",
        "snapshot_completed": false
      } 

      Additional points

      • The test case file also contains additional comments and observations about other potential issues and questions that, while not directly related to this bug, would be valuable if those can be addressed as well.

      Do let me know if you need any additional information or clarification on this issue.

       

              rh-ee-mvitale Mario Fiore Vitale
              chiragkava Chirag Kava
              Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: