Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 2.3.0.Beta1
Affects Version/s: 2.2.0.Final
Component/s: debezium-server
Labels:
None

Blocked:
False
Blocked Reason:
None
Ready:
False
Git Pull Request:
https://github.com/debezium/debezium-server/pull/23

Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Bug report

When the load on PubSub gets too high (some 2.5k+ messages per second) there's a hight chance for the connection to fail. It looks like Debezium Server is not able to recover from failures like that and halts. This happens especially when taking initial snapshots and draws snapshotting practically unusable. Errors that I have seen occurring with high load for PubSub are `INBOUND GO_AWAY` and `INBOUND RST_STREAM`. When either of these occurs, Debezium halts. To my understanding pubsub library does not have reconnect or retry mechanism for these cases so this would require custom retry or reconnect code on Debezium Server pubsub sink side.

To reproduce the issue, you need a database with millions of rows. You also need Debezium Server and Pub/Sub. Using snapshot mode initial you should see PubSub halting randomly under high load.

What Debezium connector do you use and what version?

Debezium Server: 2.2

What is the connector configuration?

debezium.sink.type=pubsub
debezium.sink.pubsub.project.id={{ .Values.project_id }}
debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector
debezium.source.offset.storage.file.filename=data/offsets.dat
debezium.source.offset.flush.interval.ms=0
debezium.source.database.hostname=localhost
debezium.source.database.port=5432
debezium.source.database.user={{ .Values.database_user }}
debezium.source.database.dbname={{ .Values.database_name }}
debezium.source.plugin.name=pgoutput
debezium.source.table.include.list=<several tables>
debezium.source.database.server.name={{ .Values.sql_instance_name }}
debezium.source.snapshot.mode=initial
debezium.source.tombstones.on.delete=false
# prefix is required but not used
debezium.source.topic.prefix=none
debezium.transforms=Reroute
debezium.transforms.Reroute.type=io.debezium.transforms.ByLogicalTableRouter
debezium.transforms.Reroute.topic.regex=(.*)public(.*)
debezium.transforms.Reroute.topic.replacement={{ .Values.pubsub_topic_name }}
# enable logging for debugging:
# quarkus.log.console.json=true
quarkus.log.level=TRACE

What is the captured database version and mode of depoyment?

Postgres 14.5, managed on Google Cloud

What behaviour do you expect?

I'm expecting two things:

Debezium should be resilient for errors like these and should be able to reconnect after an error.
There should be a way to throttle the throughput on Debezium side for snapshots at least.

First option would be preferable, but second option could be good enough as well.

What behaviour do you see?

When PubSub fails under heavy load, Debezium Server halts and does not recover from that. It happens easily during snapshotting.

Do you see the same behaviour using the latest relesead Debezium version?

I have tested on Debezium Server 2.2 and to my understanding there is no code that would solve this in the coming versions.

Do you have the connector logs, ideally from start till finish?

Here are some logs and further discussion about the topic: https://debezium.zulipchat.com/#narrow/stream/350571-community-dbz-server/topic/An.20official.20helm.20chart/near/356748927

How to reproduce the issue using our tutorial deployment?

To reproduce the issue, you need a database with millions of rows (maybe 20 million spread into multiple different tables?). You also need Debezium Server and Pub/Sub. Using snapshot mode initial you should see PubSub halting randomly under high load. The database and Debezium Server both might need to be deployed in google cloud on large enough machines to ensure high enough throughput (2.5k-3k+ events per second). In my testing this eventually halts Debezium due to pubsub grpc errors that are not handled gracefully.

Feature request or enhancement

For feature requests or enhancements, provide this information, please:

Which use case/requirement will be addressed by the proposed feature?

You cannot do snapshots with Debezium & Pubsub currently. Fixing this would enable to generate snapshots using Debezium so that you would not need to rely on custom scripts to do backfilling.

Implementation ideas (optional)

Preferable option: Build a retry or reconnect logic to prevent Debezium from halting
Alternative: Option to throttle the snapshot throughput

is related to

DBZ-5175 Debezium Server stops sending events to Google Cloud Pub/Sub

Closed

links to

RHEA-2023:120698 Red Hat build of Debezium 2.3.4 release

Details

Description

Bug report

What Debezium connector do you use and what version?

What is the connector configuration?

What is the captured database version and mode of depoyment?

What behaviour do you expect?

What behaviour do you see?

Do you see the same behaviour using the latest relesead Debezium version?

Do you have the connector logs, ideally from start till finish?

How to reproduce the issue using our tutorial deployment?

Feature request or enhancement

Which use case/requirement will be addressed by the proposed feature?

Implementation ideas (optional)

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide