-
Bug
-
Resolution: Done
-
Major
-
2.5.0.Final
-
None
-
False
-
None
-
False
Bug report
This is a continuation of https://issues.redhat.com/browse/DBZ-6635. Debezium still fails to issue heartbeat query action statements when the database is starting under some conditions:
- There must be a pre-existing replication slot.
- There must be a stored LSN offset for the connector in Connect that Debezium will want to resume from.
- The replication slot has nothing to read from it. This is indicated by:
- Stop Kafka Connect / stop the connector completely.
- Connecting to the database with psql and run select pg_logical_slot_peek_binary_changes('my_slot_name', null, 10, 'proto_version', '1', 'publication_names', 'my_publication_name');
- Confirm that there are no messages returned.
Under these conditions, the connector will cause unconstrained WAL growth when there are other databases advancing the WAL (a big reason why heartbeat.action.query exists).
What Debezium connector do you use and what version?
Debezium PostgreSQL 2.5.0.Final
Steps to reproduce
- Create a heartbeat table in a test database that otherwise has absolutely zero traffic to it. This should do:
CREATE SCHEMA framework; CREATE TABLE framework.dbz_heartbeat ( heartbeat_id text NOT NULL PRIMARY KEY, heartbeat_ts timestamp with time zone NOT NULL );
- Now, create a connector for the database with a configuration similar to this one. The crux of this configuration is that heartbeat interval is set to 60000 (1 minute) and a heartbeat action query is created to UPSERT into the above heartbeat table.
# General connector setup name=DebeziumWalGrowth connector.class=io.debezium.connector.postgresql.PostgresConnector tasks.max=1 # Basic database connection information database.hostname=10.10.100.1 database.port=5432 database.user=cdc_replication database.dbname=featureflags # Logical replication configuration publication.name=cdc_publication publication.autocreate.mode=false slot.name=debezium_wal_growth plugin.name=pgoutput # Disable snapshots on initialization snapshot.mode=never table.include.list=public.ffl_feature_flags topic.prefix=debezium-wal-growth # Heartbeat heartbeat.interval.ms=60000 heartbeat.action.query=INSERT INTO framework.dbz_heartbeat (heartbeat_id, heartbeat_ts) VALUES ('main_cdc', current_timestamp) ON CONFLICT (heartbeat_id) DO UPDATE SET heartbeat_ts = EXCLUDED.heartbeat_ts; key.converter=io.confluent.connect.avro.AvroConverter key.converter.schema.registry.url=http://10.10.100.1:17731 value.converter=io.confluent.connect.avro.AvroConverter value.converter.schema.registry.url=http://10.10.100.1:17731
- Constantly poll the heartbeat table from the psql CLI tool. Just do it by hand, and run the query every second or two.
SELECT * FROM framework.dbz_heartbeat;
As soon as you see the timestamp updated, indicating the heartbeat action was just now run a moment ago, wait a few seconds, and then gracefully stop Kafka Connect.
- Now that Kafka Connect is stopped, let's confirm the status of your replication slot. There should be zero pending messages in it. If there's stuff still showing up there, then this bug probably won't reproduce.
SELECT pg_logical_slot_peek_binary_changes('debezium_wal_growth', null, 10, 'proto_version', '1', 'publication_names', 'cdc_publication');
If there are rows shown, then try this to clear them out; repeat until there is nothing left:
SELECT pg_logical_slot_get_binary_changes('debezium_wal_growth', null, null, 'proto_version', '1', 'publication_names', 'cdc_publication');
- Restart Kafka Connect / the connector.
NOTE: While these steps to reproduce involves manually clearing out the slot, I have run into this naturally just from frequently restarting a local Connect instance during unrelated development.
What is the captured database version and mode of depoyment?
PostgreSQL 13 running locally in docker
What behaviour do you expect?
Under the circumstances established above, I still expect that Debezium will continue to issue heartbeat query actions. These should in turn result in the new LSN offsets being committed to PostgreSQL and to Kafka Connect as normal, and thus prevent unconstrained WAL growth when there are other high traffic databases mixed on the same cluster with a zero traffic database.
What behaviour do you see?
The connector gets stuck on startup, with this being the last message logged:
[2024-01-05 09:39:28,018] INFO [DebeziumWalGrowth|task-0] Searching for WAL resume position (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:341)
No records are written to the heartbeat table. The WAL therefore grows unbounded if other databases experience traffic, because this replication slot is not advancing.
Sometimes I have subsequently seen an error like this one after the above message is logged (from a different connector):
INFO [PostgresMainCDCSource-v001-targeting|task-0|offsets] Received offset commit request on 'LSN{1/54E256C0}', but ignoring it. LSN flushing is not allowed yet (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:426)
Do you see the same behaviour using the latest released Debezium version?
Yes, version 2.5.0.Final
Preliminary investigation
The issue reported in https://issues.redhat.com/browse/DBZ-6635 was very similar, and resulted in this pull request: https://github.com/debezium/debezium/pull/4779. This resulted in removing a conditional around the dispatchHeartbeatEvent function call in the main message processing loop. This means the main message processing loop is guaranteed to be running the heartbeat action regularly.
However, there is another message processing loop that runs on startup: the searchWalPosition loop: https://github.com/debezium/debezium/blob/v2.5.0.Final/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresStreamingChangeEventSource.java#L336. This loop appears to loop forever until a message is received. And that's evidenced in the logs cited above, where the WAL resume search is never ended: the connector gets stuck in this loop.
And.... unlike the main message loop, that searchWalPosition loop does not have a heartbeat call. At all. (Should one be added?) So naturally, the WAL continues to grow since there is no heartbeat action query being run.
- links to
-
RHEA-2024:129636 Red Hat build of Debezium 2.5.4 release