Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 2.5.1.Final, 2.6.0.Alpha1
Affects Version/s: 2.5.0.Final
Component/s: postgresql-connector
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Git Pull Request:
https://github.com/debezium/debezium/pull/5163

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Bug report

This is a continuation of https://issues.redhat.com/browse/DBZ-6635. Debezium still fails to issue heartbeat query action statements when the database is starting under some conditions:

There must be a pre-existing replication slot.
There must be a stored LSN offset for the connector in Connect that Debezium will want to resume from.
The replication slot has nothing to read from it. This is indicated by:
- Stop Kafka Connect / stop the connector completely.
- Connecting to the database with psql and run select pg_logical_slot_peek_binary_changes('my_slot_name', null, 10, 'proto_version', '1', 'publication_names', 'my_publication_name');
- Confirm that there are no messages returned.

Under these conditions, the connector will cause unconstrained WAL growth when there are other databases advancing the WAL (a big reason why heartbeat.action.query exists).

What Debezium connector do you use and what version?

Debezium PostgreSQL 2.5.0.Final

Steps to reproduce

Create a heartbeat table in a test database that otherwise has absolutely zero traffic to it. This should do:

CREATE SCHEMA framework;
CREATE TABLE framework.dbz_heartbeat (
    heartbeat_id text NOT NULL PRIMARY KEY,
    heartbeat_ts timestamp with time zone NOT NULL
);

Now, create a connector for the database with a configuration similar to this one. The crux of this configuration is that heartbeat interval is set to 60000 (1 minute) and a heartbeat action query is created to UPSERT into the above heartbeat table.

# General connector setup
name=DebeziumWalGrowth
connector.class=io.debezium.connector.postgresql.PostgresConnector
tasks.max=1

# Basic database connection information
database.hostname=10.10.100.1
database.port=5432
database.user=cdc_replication
database.dbname=featureflags

# Logical replication configuration
publication.name=cdc_publication
publication.autocreate.mode=false
slot.name=debezium_wal_growth
plugin.name=pgoutput

# Disable snapshots on initialization
snapshot.mode=never
table.include.list=public.ffl_feature_flags
topic.prefix=debezium-wal-growth

# Heartbeat
heartbeat.interval.ms=60000
heartbeat.action.query=INSERT INTO framework.dbz_heartbeat (heartbeat_id, heartbeat_ts) VALUES ('main_cdc', current_timestamp) ON CONFLICT (heartbeat_id) DO UPDATE SET heartbeat_ts = EXCLUDED.heartbeat_ts;

key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://10.10.100.1:17731
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://10.10.100.1:17731

Constantly poll the heartbeat table from the psql CLI tool. Just do it by hand, and run the query every second or two.

SELECT * FROM framework.dbz_heartbeat;

As soon as you see the timestamp updated, indicating the heartbeat action was just now run a moment ago, wait a few seconds, and then gracefully stop Kafka Connect.

Now that Kafka Connect is stopped, let's confirm the status of your replication slot. There should be zero pending messages in it. If there's stuff still showing up there, then this bug probably won't reproduce.

SELECT pg_logical_slot_peek_binary_changes('debezium_wal_growth', null, 10, 'proto_version', '1', 'publication_names', 'cdc_publication');

If there are rows shown, then try this to clear them out; repeat until there is nothing left:

SELECT pg_logical_slot_get_binary_changes('debezium_wal_growth', null, null, 'proto_version', '1', 'publication_names', 'cdc_publication');

Restart Kafka Connect / the connector.

NOTE: While these steps to reproduce involves manually clearing out the slot, I have run into this naturally just from frequently restarting a local Connect instance during unrelated development.

What is the captured database version and mode of depoyment?

PostgreSQL 13 running locally in docker

What behaviour do you expect?

Under the circumstances established above, I still expect that Debezium will continue to issue heartbeat query actions. These should in turn result in the new LSN offsets being committed to PostgreSQL and to Kafka Connect as normal, and thus prevent unconstrained WAL growth when there are other high traffic databases mixed on the same cluster with a zero traffic database.

What behaviour do you see?

The connector gets stuck on startup, with this being the last message logged:

[2024-01-05 09:39:28,018] INFO [DebeziumWalGrowth|task-0] Searching for WAL resume position (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:341)

No records are written to the heartbeat table. The WAL therefore grows unbounded if other databases experience traffic, because this replication slot is not advancing.

Sometimes I have subsequently seen an error like this one after the above message is logged (from a different connector):

INFO [PostgresMainCDCSource-v001-targeting|task-0|offsets] Received offset commit request on 'LSN{1/54E256C0}', but ignoring it. LSN flushing is not allowed yet (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource:426)

Do you see the same behaviour using the latest released Debezium version?

Yes, version 2.5.0.Final

Preliminary investigation

The issue reported in https://issues.redhat.com/browse/DBZ-6635 was very similar, and resulted in this pull request: https://github.com/debezium/debezium/pull/4779. This resulted in removing a conditional around the dispatchHeartbeatEvent function call in the main message processing loop. This means the main message processing loop is guaranteed to be running the heartbeat action regularly.

However, there is another message processing loop that runs on startup: the searchWalPosition loop: https://github.com/debezium/debezium/blob/v2.5.0.Final/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/PostgresStreamingChangeEventSource.java#L336. This loop appears to loop forever until a message is received. And that's evidenced in the logs cited above, where the WAL resume search is never ended: the connector gets stuck in this loop.

And.... unlike the main message loop, that searchWalPosition loop does not have a heartbeat call. At all. (Should one be added?) So naturally, the WAL continues to grow since there is no heartbeat action query being run.

links to

RHEA-2024:129636 Red Hat build of Debezium 2.5.4 release

Details

Description

Bug report

What Debezium connector do you use and what version?

Steps to reproduce

What is the captured database version and mode of depoyment?

What behaviour do you expect?

What behaviour do you see?

Do you see the same behaviour using the latest released Debezium version?

Preliminary investigation

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates