-
Bug
-
Resolution: Done
-
Major
-
2.2.0.Final
-
None
Bug report
When the load on PubSub gets too high (some 2.5k+ messages per second) there's a hight chance for the connection to fail. It looks like Debezium Server is not able to recover from failures like that and halts. This happens especially when taking initial snapshots and draws snapshotting practically unusable. Errors that I have seen occurring with high load for PubSub are `INBOUND GO_AWAY` and `INBOUND RST_STREAM`. When either of these occurs, Debezium halts. To my understanding pubsub library does not have reconnect or retry mechanism for these cases so this would require custom retry or reconnect code on Debezium Server pubsub sink side.
To reproduce the issue, you need a database with millions of rows. You also need Debezium Server and Pub/Sub. Using snapshot mode initial you should see PubSub halting randomly under high load.
What Debezium connector do you use and what version?
Debezium Server: 2.2
What is the connector configuration?
debezium.sink.type=pubsub debezium.sink.pubsub.project.id={{ .Values.project_id }} debezium.source.connector.class=io.debezium.connector.postgresql.PostgresConnector debezium.source.offset.storage.file.filename=data/offsets.dat debezium.source.offset.flush.interval.ms=0 debezium.source.database.hostname=localhost debezium.source.database.port=5432 debezium.source.database.user={{ .Values.database_user }} debezium.source.database.dbname={{ .Values.database_name }} debezium.source.plugin.name=pgoutput debezium.source.table.include.list=<several tables> debezium.source.database.server.name={{ .Values.sql_instance_name }} debezium.source.snapshot.mode=initial debezium.source.tombstones.on.delete=false # prefix is required but not used debezium.source.topic.prefix=none debezium.transforms=Reroute debezium.transforms.Reroute.type=io.debezium.transforms.ByLogicalTableRouter debezium.transforms.Reroute.topic.regex=(.*)public(.*) debezium.transforms.Reroute.topic.replacement={{ .Values.pubsub_topic_name }} # enable logging for debugging: # quarkus.log.console.json=true quarkus.log.level=TRACE
What is the captured database version and mode of depoyment?
Postgres 14.5, managed on Google Cloud
What behaviour do you expect?
I'm expecting two things:
- Debezium should be resilient for errors like these and should be able to reconnect after an error.
- There should be a way to throttle the throughput on Debezium side for snapshots at least.
First option would be preferable, but second option could be good enough as well.
What behaviour do you see?
When PubSub fails under heavy load, Debezium Server halts and does not recover from that. It happens easily during snapshotting.
Do you see the same behaviour using the latest relesead Debezium version?
I have tested on Debezium Server 2.2 and to my understanding there is no code that would solve this in the coming versions.
Do you have the connector logs, ideally from start till finish?
Here are some logs and further discussion about the topic: https://debezium.zulipchat.com/#narrow/stream/350571-community-dbz-server/topic/An.20official.20helm.20chart/near/356748927
<Your answer>
How to reproduce the issue using our tutorial deployment?
To reproduce the issue, you need a database with millions of rows (maybe 20 million spread into multiple different tables?). You also need Debezium Server and Pub/Sub. Using snapshot mode initial you should see PubSub halting randomly under high load. The database and Debezium Server both might need to be deployed in google cloud on large enough machines to ensure high enough throughput (2.5k-3k+ events per second). In my testing this eventually halts Debezium due to pubsub grpc errors that are not handled gracefully.
Feature request or enhancement
For feature requests or enhancements, provide this information, please:
Which use case/requirement will be addressed by the proposed feature?
You cannot do snapshots with Debezium & Pubsub currently. Fixing this would enable to generate snapshots using Debezium so that you would not need to rely on custom scripts to do backfilling.
Implementation ideas (optional)
Preferable option: Build a retry or reconnect logic to prevent Debezium from halting
Alternative: Option to throttle the snapshot throughput
- is related to
-
DBZ-5175 Debezium Server stops sending events to Google Cloud Pub/Sub
- Closed
- links to
-
RHEA-2023:120698 Red Hat build of Debezium 2.3.4 release