-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
1.5.0.Final
-
None
-
False
-
False
-
Undefined
-
-
A short back story: With 1.1, we occasionally hit the bug where before restarting, Dbz would not commit LSN offset to Postgres, and afterwards would spam with the "LSN is 123, but last processed value is 456, skipping". Then in 1.2, this spamming was reduced significantly, but the processing speed was very poor.
The issue: Yesterday we hit the "very slowly skipping over LSN" issue with 30GB of data, and by my vague estimates it would take several days to process the backlog (while new data was queuing up). In an attempt to solve it, I upgraded to 1.4.2 which skipped over the 30GB already-processed-WAL very quickly, but as a result, ALL connectors stopped updating both restart_lsn and confirmed_flush_lsn.
I've tried also 1.3.1 and 1.5.0 and both still have the same problem.
The current "solution" is to restart debezium regularly before the disk space runs out, as the WAL is piling up.
On restart, Dbz correctly finds some newer LSN offsets in Kafka, and updates the postgres ones - but again, only on restart. One database for example:
restart_lsn | 2AAF/CFF68660
confirmed_flush_lsn | 2AB0/26F97D68
while the current LSN is 2AB1/8C3FE1E0. There is no network bottleneck, almost zero CPU usage for Dbz, the events are all getting streamed.
This affects all 12 connector, all Postgres / decorderbufs.
Using: Postgres 12.x, Kafka 2.5.0 (CP 5.5.1), inter.broker.protocol.version = 2.5. Debezium is using official docker images with some custom extras (liveness probe).
Some databases use table include/exclude lists, but most of them stream almost everything, and there's plenty of traffic (from a handful to thousands of messages per second). A sample connector config:
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"errors.log.include.messages": "true",
"max.queue.size": "80000",
"include.schema.changes": "false",
"table.whitelist": "public.foo,public.bar,public.baz",
"errors.retries.delay.max.ms": "15000",
"decimal.handling.mode": "string",
"poll.interval.ms": "1250",
"errors.log.enable": "true",
"snapshot.fetch.size": "20000",
"database.tcpKeepAlive": "true",
"errors.retries.limit": "23040",
"heartbeat.interval.ms": "60000",
"plugin.name": "decoderbufs",
"schema.whitelist": "public",
"key.converter.schemas.enable": "false",
"value.converter.schemas.enable": "false",
"max.batch.size": "40000",
"snapshot.mode": "initial"
Where do I even start troubleshooting it? Is there any specific logger that I could enable to log more about Postgres-related shenanigans? The default loglevel is now Info. I created a secret Gist with the current log files, which I could message someone on Gitter, but they don't see much help. I'd be glad to try some beta build if needed.