Feature request or enhancement

Which use case/requirement will be addressed by the proposed feature?

For Postgres, the Incremental Snapshots are not performant for large tables with large index. The query lacks some attributes for Postgres to choose optimal query plan. For example, consider the following scenario.

Configure additionalCondition for execute-snapshot as follows for

updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)

For the initial chunk it generates the query as follows.

SELECT * FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY "id" LIMIT 1024

The query planner may decide to perform scan on `my_table_pkey` index even though scanning the index `index_my_table_updated_at` is optimal for this. If the my_table_pkey index is too big (100GB), this may take a while to finish or even time out.

One possible solution is to calculate the min and max PK range and use that PK range even for the initial chunk.

For example min and max PK range can be calculated as

min_id = SELECT id FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY id ASC LIMIT 1;
max_id = SELECT id FROM "public"."my_table" WHERE updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147) ORDER BY ID DESC LIMIT 1;

initial chunk query

SELECT * FROM "public"."my_table" WHERE (id >= min_id AND id <= max_id) AND (updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)) ORDER BY "id" LIMIT 1024

and subsequent chunks can be as follows. (assuming last_id is the id of the row that we last wrote)

SELECT * FROM "public"."my_table" WHERE (id > last_id AND id <= max_id) AND (updated_at BETWEEN to_timestamp(1669905612.647841) AND to_timestamp(1669908892.4985147)) ORDER BY "id" LIMIT 1024

This should work well if Debezium chooses my_table_pkey index or index_my_table_updated_at. This will help for the use cases where we want to backfill recently updated rows.

Implementation ideas (optional)

Calculate min and max key for the incremental snapshot here and use the min/max range for all the chunks.

Assignee:: Unassigned

Reporter:: Praveen Burgu

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: 2022/12/05 11:35 PM

Updated:: 2022/12/08 10:17 AM

Details

Description

Feature request or enhancement

Which use case/requirement will be addressed by the proposed feature?

Implementation ideas (optional)

Attachments

Easy Agile Planning Poker

Activity

People

Dates