Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-7273

Provide alternative direct query for faster execution

XMLWordPrintable

    • False
    • None
    • False

      Feature request or enhancement

      Debezium SQL Server connector uses cdc.fn_cdc_get_all_changes_... stored procedure to get cdc data. That procedure executes a union all of 3 selects only one of which returns rows at any given execution based on the arguments:

      create function [cdc].[fn_cdc_get_all_changes_dbo_t1]
          (   @from_lsn binary(10),
              @to_lsn binary(10),
              @row_filter_option nvarchar(30)
          )
          returns table
          return
      
          select NULL as __$start_lsn,
              NULL as __$seqval,
              NULL as __$operation,
              NULL as __$update_mask, NULL as [id], NULL as [name]
          where ( [sys].[fn_cdc_check_parameters]( N'dbo_t1', @from_lsn, @to_lsn, lower(rtrim(ltrim(@row_filter_option))), 0) = 0)
      
          union all
      
          select t.__$start_lsn as __$start_lsn,
              t.__$seqval as __$seqval,
              t.__$operation as __$operation,
              t.__$update_mask as __$update_mask, t.[id], t.[name]
          from [cdc].[dbo_t1_CT] t with (nolock)
          where (lower(rtrim(ltrim(@row_filter_option))) = 'all')
              and ( [sys].[fn_cdc_check_parameters]( N'dbo_t1', @from_lsn, @to_lsn, lower(rtrim(ltrim(@row_filter_option))), 0) = 1)
              and (t.__$operation = 1 or t.__$operation = 2 or t.__$operation = 4)
              and (t.__$start_lsn <= @to_lsn)
              and (t.__$start_lsn >= @from_lsn)
      
          union all
      
          select t.__$start_lsn as __$start_lsn,
              t.__$seqval as __$seqval,
              t.__$operation as __$operation,
              t.__$update_mask as __$update_mask, t.[id], t.[name]
          from [cdc].[dbo_t1_CT] t with (nolock)
          where (lower(rtrim(ltrim(@row_filter_option))) = 'all update old')
              and ( [sys].[fn_cdc_check_parameters]( N'dbo_t1', @from_lsn, @to_lsn, lower(rtrim(ltrim(@row_filter_option))), 0) = 1)
              and (t.__$operation = 1 or t.__$operation = 2 or t.__$operation = 4 or
                   t.__$operation = 3 )
              and (t.__$start_lsn <= @to_lsn)
              and (t.__$start_lsn >= @from_lsn)
       

      The first select returns a row only when fn_cdc_check_parameters returns 0, the second one - only when fn_cdc_check_parameters returns 1 and @row_filter_option == 'all', the third one - only when fn_cdc_check_parameters returns 1 and @row_filter_option == 'all update old'.

      Because of those union all the database has to create a temporary table. Since debezium asks the result set to be ordered and there are no indices on the temp table to help the database has to sort the whole result set by itself which is very slow and resource intensive.

      In order to confirm that it's union all does not allow the database to use indices to order the result set I've created the following stored procedure with first two select statements from the original stored prcedure removed:

      create function [cdc].[fn_cdc_get_all_changes_dbo_t1]
          (   @from_lsn binary(10),
              @to_lsn binary(10),
              @row_filter_option nvarchar(30)
          )
          returns table
          return
          select t.__$start_lsn as __$start_lsn,
              t.__$seqval as __$seqval,
              t.__$operation as __$operation,
              t.__$update_mask as __$update_mask, t.[id], t.[name]
          from [cdc].[dbo_t1_CT] t with (nolock)
          where (lower(rtrim(ltrim(@row_filter_option))) = 'all update old')
              and ( [sys].[fn_cdc_check_parameters]( N'dbo_t1', @from_lsn, @to_lsn, lower(rtrim(ltrim(@row_filter_option))), 0) = 1)
              and (t.__$operation = 1 or t.__$operation = 2 or t.__$operation = 4 or
                   t.__$operation = 3 )
              and (t.__$start_lsn <= @to_lsn)
              and (t.__$start_lsn >= @from_lsn)
      

      Additionally I've defined an index to help the query to sort the result set:

      create index index1 on cdc.dbo_t1_CT (__$start_lsn ASC, __$seqval ASC, __$operation ASC)
      

      The execution time drops from 30 seconds to about 400 ms on a table containing ~600K rows. Querying [cdc].[dbo_t1_CT] capture instance table directly yielded similar results (400ms). This confirms that union all is the bottleneck.

      Which use case/requirement will be addressed by the proposed feature?

      We regularly get huge transactions to consume that contain hundreds of thousands or even millions of rows. It often requires manual intervention to adjust `max.iteration.transactions` so that the result set is not too big and it doesn't put too much pressure on the database server. Implementing this enhancement should ease the burden put on the database during those times.

      Implementation ideas (optional)

      Instead of executing the stored procedure debezium could query the capture instance table directly. This would allow the database to use indices (if there are any) to order the result set. The query could look something like this:

      select *
      from cdc.dbo_t1_CT t
      WHERE t.[__$start_lsn] >= CONVERT(binary, '0x00098eb50000a7840005',1)
      AND t.[__$start_lsn] <= CONVERT(binary,'0x00098eb7000229ea0001',1)
      order by t.[__$start_lsn] ASC, t.[__$seqval] ASC, t.[__$operation] ASC
      

      Users who need better performance could create the index by themselves.

              Unassigned Unassigned
              ramanenka Vadzim Ramanenka (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: