Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-5896

Postgres Incremental Snapshots are not performant when additional condition is used

XMLWordPrintable

    • Icon: Enhancement Enhancement
    • Resolution: Unresolved
    • Icon: Major Major
    • 2.2-backlog
    • None
    • postgresql-connector
    • None
    • False
    • None
    • False

      Feature request or enhancement

      Which use case/requirement will be addressed by the proposed feature?

      For Postgres, the Incremental Snapshots are not performant for large tables with large index. The query lacks some attributes for Postgres to choose optimal query plan. For example, consider the following scenario.

      Configure additionalCondition for execute-snapshot as follows for

       

      updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)  

      For the initial chunk it generates the query as follows.

      SELECT * FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY "id" LIMIT 1024

      The query planner may decide to perform scan on  `my_table_pkey` index even though scanning the index `index_my_table_updated_at` is optimal for this. If the my_table_pkey index is too big (100GB), this may take a while to finish or even time out. 

      One possible solution is to calculate the min and max PK range and use that PK range even for the initial chunk.

      For example min and max PK range can be calculated as

       

      min_id = SELECT id FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY id ASC LIMIT 1;
      max_id = SELECT id FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY ID DESC LIMIT 1;
       
      

      initial chunk query

      SELECT * FROM "public"."my_table" WHERE (id >= min_id AND id <= max_id) AND (updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)) ORDER BY "id" LIMIT 1024

      and subsequent chunks can be as follows. (assuming last_id is the id of the row that we last wrote)

       

      SELECT * FROM "public"."my_table" WHERE (id > last_id AND id <= max_id) AND (updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)) ORDER BY "id" LIMIT 1024

       

      This should work well if Debezium chooses my_table_pkey index or index_my_table_updated_at. This will help for the use cases where we want to backfill recently updated rows.

      Implementation ideas (optional)

      Calculate min and max key for the incremental snapshot here and use the min/max range for all the chunks. 

              Unassigned Unassigned
              prburgu Praveen Burgu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: