-
Enhancement
-
Resolution: Unresolved
-
Major
-
None
-
None
-
False
-
None
-
False
Feature request or enhancement
Which use case/requirement will be addressed by the proposed feature?
For Postgres, the Incremental Snapshots are not performant for large tables with large index. The query lacks some attributes for Postgres to choose optimal query plan. For example, consider the following scenario.
Configure additionalCondition for execute-snapshot as follows for
updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)
For the initial chunk it generates the query as follows.
SELECT * FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY "id" LIMIT 1024
The query planner may decide to perform scan on `my_table_pkey` index even though scanning the index `index_my_table_updated_at` is optimal for this. If the my_table_pkey index is too big (100GB), this may take a while to finish or even time out.
One possible solution is to calculate the min and max PK range and use that PK range even for the initial chunk.
For example min and max PK range can be calculated as
min_id = SELECT id FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY id ASC LIMIT 1; max_id = SELECT id FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY ID DESC LIMIT 1;
initial chunk query
SELECT * FROM "public"."my_table" WHERE (id >= min_id AND id <= max_id) AND (updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)) ORDER BY "id" LIMIT 1024
and subsequent chunks can be as follows. (assuming last_id is the id of the row that we last wrote)
SELECT * FROM "public"."my_table" WHERE (id > last_id AND id <= max_id) AND (updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)) ORDER BY "id" LIMIT 1024
This should work well if Debezium chooses my_table_pkey index or index_my_table_updated_at. This will help for the use cases where we want to backfill recently updated rows.
Implementation ideas (optional)
Calculate min and max key for the incremental snapshot here and use the min/max range for all the chunks.