Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-7316

Debezium heartbeat.action.query does not start before writing to WAL: part 2

XMLWordPrintable

      Bug report

      This is a continuation of https://issues.redhat.com/browse/DBZ-6635.  Debezium still fails to issue heartbeat query action statements when the database is starting under some conditions:

      • There must be a pre-existing replication slot.
      • There must be a stored LSN offset for the connector in Connect that Debezium will want to resume from.
      • The replication slot has nothing to read from it.  This is indicated by:
        • Stop Kafka Connect / stop the connector completely.
        • Connecting to the database with psql and run select pg_logical_slot_peek_binary_changes('my_slot_name', null, 10, 'proto_version', '1', 'publication_names', 'my_publication_name');
        • Confirm that there are no messages returned.

      Under these conditions, the connector will cause unconstrained WAL growth when there are other databases advancing the WAL (a big reason why heartbeat.action.query exists).

      What Debezium connector do you use and what version?

      Debezium PostgreSQL 2.5.0.Final

      Steps to reproduce

      1. Create a heartbeat table in a test database that otherwise has absolutely zero traffic to it.  This should do:
      CREATE SCHEMA framework;
      CREATE TABLE framework.dbz_heartbeat (
          heartbeat_id text NOT NULL PRIMARY KEY,
          heartbeat_ts timestamp with time zone NOT NULL
      ); 
      1. Now, create a connector for the database with a configuration similar to this one.  The crux of this configuration is that heartbeat interval is set to 60000 (1 minute) and a heartbeat action query is created to UPSERT into the above heartbeat table.
      # General connector setup
      name=DebeziumWalGrowth
      connector.class=io.debezium.connector.postgresql.PostgresConnector
      tasks.max=1
      
      # Basic database connection information
      database.hostname=10.10.100.1
      database.port=5432
      database.user=cdc_replication
      database.dbname=featureflags
      
      # Logical replication configuration
      publication.name=cdc_publication
      publication.autocreate.mode=false
      slot.name=debezium_wal_growth
      plugin.name=pgoutput
      
      # Disable snapshots on initialization
      snapshot.mode=never
      table.include.list=public.ffl_feature_flags
      topic.prefix=debezium-wal-growth
      
      # Heartbeat
      heartbeat.interval.ms=60000
      heartbeat.action.query=INSERT INTO framework.dbz_heartbeat (heartbeat_id, heartbeat_ts) VALUES ('main_cdc', current_timestamp) ON CONFLICT (heartbeat_id) DO UPDATE SET heartbeat_ts = EXCLUDED.heartbeat_ts;
      
      key.converter=io.confluent.connect.avro.AvroConverter
      key.converter.schema.registry.url=http://10.10.100.1:17731
      value.converter=io.confluent.connect.avro.AvroConverter
      value.converter.schema.registry.url=http://10.10.100.1:17731 
      1. Constantly poll the heartbeat table from the psql CLI tool.  Just do it by hand, and run the query every second or two.
      SELECT * FROM framework.dbz_heartbeat;
      

      As soon as you see the timestamp updated, indicating the heartbeat action was just now run a moment ago, wait a few seconds, and then gracefully stop Kafka Connect.

      1. Now that Kafka Connect is stopped, let's confirm the status of your replication slot.  There should be zero pending messages in it.  If there's stuff still showing up there, then this bug probably won't reproduce.
      SELECT pg_logical_slot_peek_binary_changes('debezium_wal_growth', null, 10, 'proto_version', '1', 'publication_names', 'cdc_publication');

      If there are rows shown, then try this to clear them out; repeat until there is nothing left:

      SELECT pg_logical_slot_get_binary_changes('debezium_wal_growth', null, null, 'proto_version', '1', 'publication_names', 'cdc_publication'); 
      1. Restart Kafka Connect / the connector.

      NOTE: While these steps to reproduce involves manually clearing out the slot, I have run into this naturally just from frequently restarting a local Connect instance during unrelated development.

      What is the captured database version and mode of depoyment?

      PostgreSQL 13 running locally in docker

      What behaviour do you expect?

      Under the circumstances established above, I still expect that Debezium will continue to issue heartbeat query actions.  These should in turn result in the new LSN offsets being committed to PostgreSQL and to Kafka Connect as normal, and thus prevent unconstrained WAL growth when there are other high traffic databases mixed on the same cluster with a zero traffic database.

      What behaviour do you see?

      The connector gets stuck on startup, with this being the last message logged:

       

      [2024-01-05 09:39:28,018] INFO [DebeziumWalGrowth|task-0] Searching for WAL resume position (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:341) 

      No records are written to the heartbeat table.  The WAL therefore grows unbounded if other databases experience traffic, because this replication slot is not advancing.

      Sometimes I have subsequently seen an error like this one after the above message is logged (from a different connector):

      INFO [PostgresMainCDCSource-v001-targeting|task-0|offsets] Received offset commit request on 'LSN{1/54E256C0}', but ignoring it. LSN flushing is not allowed yet (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:426)

       

      Do you see the same behaviour using the latest released Debezium version?

      Yes, version 2.5.0.Final

      Preliminary investigation

      The issue reported in https://issues.redhat.com/browse/DBZ-6635 was very similar, and resulted in this pull request: https://github.com/debezium/debezium/pull/4779.  This resulted in removing a conditional around the dispatchHeartbeatEvent function call in the main message processing loop.  This means the main message processing loop is guaranteed to be running the heartbeat action regularly.

      However, there is another message processing loop that runs on startup: the searchWalPosition loop:  https://github.com/debezium/debezium/blob/v2.5.0.Final/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresStreamingChangeEventSource.java#L336.  This loop appears to loop forever until a message is received.  And that's evidenced in the logs cited above, where the WAL resume search is never ended: the connector gets stuck in this loop.

      And.... unlike the main message loop, that searchWalPosition loop does not have a heartbeat call.  At all.  (Should one be added?)  So naturally, the WAL continues to grow since there is no heartbeat action query being run.

              Unassigned Unassigned
              james-johnston-thumbtack James Johnston (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: