-
Bug
-
Resolution: Cannot Reproduce
-
Major
-
None
-
1.9.4.Final
-
None
-
False
-
None
-
False
-
Bug report
We have several debezium connectors deployed in kubernetes. When a pod that is running a debezium task gets replaced it causes the connector to restart as the workload is stopped on the terminating pod and is started on the initializing pod.
Sometimes, but not all of the time, some of our connectors will be unable to start producing due to this issue. This is an example of one of our connectors failing to start - we've seen this issue occur with three different connectors.
In order to quickly remediate this issue and start producing again, we have been resetting the connector offset which allows debezium to unblock itself since it needs to recheck for the LSN instead of relying on the stored LSN. We have automated this process to perform the reset - https://debezium.io/documentation/faq/#how_to_remove_committed_offsets_for_a_connector
What Debezium connector do you use and what version?
{ class: io.debezium.connector.postgresql.PostgresConnector, type: source, version: 1.9.4.Final }
What is the connector configuration?
{ buffer.memory: 134217728, connector.class: io.debezium.connector.postgresql.PostgresConnector, database.dbname: <>, database.hostname: <>, database.password: <>, database.port: 5432, database.server.name: <>, database.user: <>, decimal.handling.mode: precise, errors.tolerance: all, hastore.handling.mode: json, heartbeat.interval.ms: 60000, key.converter.schemas.enable: false, max.request.size: 10485760, name: <>, plugin.name: pgoutput, predicates: has_heartbeat_prefix, predicates.has_heartbeat_prefix.pattern: __debezium-heartbeat.*, predicates.has_heartbeat_prefix.type: org.apache.kafka.connect.transforms.predicates.TopicNameMatches, provide.transaction.metadata: false, schema.refresh.mode: columns_diff_exclude_unchanged_toast, schema.whitelist: public, slot.name: <>, snapshot.mode: never, table.whitelist: public.<>.*, tombstones.on.delete: false, transforms: filter_heartbeat_messages,filter_op_type_transform,unwrap,reroute,service_transform, transforms.filter_heartbeat_messages.predicate: has_heartbeat_prefix, transforms.filter_heartbeat_messages.type: org.apache.kafka.connect.transforms.Filter, transforms.filter_op_type_transform.allowlist: c, transforms.filter_op_type_transform.type: com.service.kafka.connect.transform.FilterOPType$Value, transforms.service_transform.type: com.service.kafka.connect.transform.ServiceTransform$Value, transforms.reroute.key.enforce.uniqueness: false, transforms.reroute.topic.regex: (.*).(public).(.*), transforms.reroute.topic.replacement: production.cdc.service.initialized, transforms.reroute.type: io.debezium.transforms.ByLogicalTableRouter, transforms.unwrap.add.headers: db,table,op,lsn,source.ts_ms,ts_ms,source.txId,source.xmin, transforms.unwrap.add.headers.prefix: , transforms.unwrap.drop.tombstones: false, transforms.unwrap.type: io.debezium.transforms.ExtractNewRecordState, value.converter: org.apache.kafka.connect.json.JsonConverter, value.converter.schemas.enable: false }
What is the captured database version and mode of deployment?
AWS Aurora
=> select version(); version ------------------------------------------------------------------------------------------------- PostgreSQL 12.9 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit (1 row)=> select aurora_version(); aurora_version ---------------- 12.9.1 (1 row)=> select aurora_internal_version(); aurora_internal_version -------------------------------------------------------------------------------------------------------------- Aurora Postgres 4.0.3300.0 built from 303c0d0c41271f66bb14c4d06c6879dedf816334 by BUILD SYSTEM on 1644948026 (1 row)
What behaviour do you expect?
Production to start back up where it had stopped.
What behaviour do you see?
Production is unable to start due to:
io.debezium.DebeziumException: Message with LSN 'LSN{6E3/6B12F1F0}' larger than expected LSN 'LSN{6E3/6B12F040}'. This is unexpected and can lead to an infinite loop or a data loss.
Do you see the same behaviour using the latest released Debezium version?
We are unable to trigger this error and cannot reproduce it at will. We have only experienced this in our production debezium deployments which are running 1.9.4.Final.
Do you have the connector logs, ideally from start till finish?
(You might be asked later to provide DEBUG/TRACE level log)
Postgres Logs
2022-08-09 09:15:54 UTC:10.68.174.0(57936):[unknown]@[unknown]:[18341]:LOG: connection received: host=10.68.174.0 port=57936 2022-08-09 09:15:54 UTC:10.68.174.0(57936):service_cdc@service:[18341]:LOG: connection authorized: user=service_cdc database=service application_name=Debezium General SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) 2022-08-09 09:15:55 UTC:10.68.174.0(57936):service_cdc@service:[18341]:LOG: disconnection: session time: 0:00:00.097 user=service_cdc database=service host=10.68.174.0 port=57936 2022-08-09 09:15:55 UTC:10.68.174.0(57938):[unknown]@[unknown]:[18346]:LOG: connection received: host=10.68.174.0 port=57938 2022-08-09 09:15:55 UTC:10.68.174.0(57938):service_cdc@service:[18346]:LOG: connection authorized: user=service_cdc database=service application_name=Debezium General SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) 2022-08-09 09:15:55 UTC:10.68.174.0(57948):[unknown]@[unknown]:[18373]:LOG: connection received: host=10.68.174.0 port=57948 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:DETAIL: Streaming transactions committing after 6E3/6B12B928, reading WAL from 6E3/6A4D11C8. 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:DETAIL: There are no running transactions. 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:LOG: logical decoding found consistent point at 6E3/6A4D11C8 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:LOG: replication connection authorized: user=service_cdc application_name=Debezium Streaming SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:LOG: starting logical decoding for slot cdc_service_events 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:STATEMENT: START_REPLICATION SLOT cdc_service_events LOGICAL 6E3/6B12EEC0 (proto_version '1', publication_names 'dbz_publication') 2022-08-09 09:15:55 UTC:10.68.174.0(57948):service_cdc@service:[18373]:STATEMENT: START_REPLICATION SLOT cdc_service_events LOGICAL 6E3/6B12EEC0 (proto_version '1', publication_names 'dbz_publication') 2022-08-09 09:15:55 UTC:10.68.174.0(57950):[unknown]@[unknown]:[18379]:LOG: connection received: host=10.68.174.0 port=57950 2022-08-09 09:15:55 UTC:10.68.174.0(57950):service_cdc@service:[18379]:LOG: connection authorized: user=service_cdc database=service application_name=Debezium Slot Info SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) 2022-08-09 09:15:55 UTC:10.68.174.0(57950):service_cdc@service:[18379]:LOG: disconnection: session time: 0:00:00.014 user=service_cdc database=service host=10.68.174.0 port=57950 2022-08-09 09:15:57 UTC:10.68.174.0(57938):service_cdc@service:[18346]:LOG: disconnection: session time: 0:00:02.617 user=service_cdc database=service host=10.68.174.0 port=57938 2022-08-09 09:15:57 UTC:10.68.174.0(57948):service_cdc@service:[18373]:LOG: disconnection: session time: 0:00:02.272 user=service_cdc database=service host=10.68.174.0 port=57948 2022-08-09 09:15:57 UTC:10.68.174.0(57972):[unknown]@[unknown]:[18438]:LOG: connection received: host=10.68.174.0 port=57972 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:DETAIL: Streaming transactions committing after 6E3/6B12B928, reading WAL from 6E3/6A4D11C8. 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:DETAIL: There are no running transactions. 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:LOG: logical decoding found consistent point at 6E3/6A4D11C8 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:LOG: replication connection authorized: user=service_cdc application_name=Debezium Streaming SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256, compression=off) 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:LOG: starting logical decoding for slot cdc_service_events 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:STATEMENT: START_REPLICATION SLOT cdc_service_events LOGICAL 6E3/6B12EEC0 (proto_version '1', publication_names 'dbz_publication') 2022-08-09 09:15:57 UTC:10.68.174.0(57972):service_cdc@service:[18438]:STATEMENT: START_REPLICATION SLOT cdc_service_events LOGICAL 6E3/6B12EEC0 (proto_version '1', publication_names 'dbz_publication') 2022-08-09 09:15:59 UTC:10.68.174.0(57972):service_cdc@service:[18438]:LOG: disconnection: session time: 0:00:02.085 user=service_cdc database=service host=10.68.174.0 port=57972
Debezium Logs
[2022-08-09 09:15:54,302] INFO Creating task cdc-service-events-initialized-0 (org.apache.kafka.connect.runtime.Worker) [2022-08-09 09:15:54,834] INFO Instantiated task cdc-service-events-initialized-0 with version 1.9.4.Final of type io.debezium.connector.postgresql.PostgresConnectorTask (org.apache.kafka.connect.runtime.Worker) [2022-08-09 09:15:54,897] INFO Starting PostgresConnectorTask with configuration: (io.debezium.connector.common.BaseSourceTask) [2022-08-09 09:15:55,141] INFO Found previous partition offset PostgresPartition [sourcePartition={server=us01.service}]: {transaction_id=null, lsn_proc=7573823745728, lsn_commit=7573823746544, lsn=7573823746544, txId=1990942187, ts_usec=1660036194523499} (io.debezium.connector.common.BaseSourceTask) [2022-08-09 09:15:55,147] INFO user 'service_cdc' connected to database 'service' on PostgreSQL 12.9 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit with roles: [2022-08-09 09:15:55,152] INFO Obtained valid replication slot ReplicationSlot [active=false, latestFlushedLsn=LSN{6E3/6B12B928}, catalogXmin=1990935203] (io.debezium.connector.postgresql.connection.PostgresConnection) [2022-08-09 09:15:55,153] INFO Found previous offset PostgresOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.postgresql.Source:STRUCT}, sourceInfo=source_info[server='us01.service'db='service', lsn=LSN{6E3/6B12F1F0}, txId=1990942187, lastCommitLsn=LSN{6E3/6B12F1F0}, timestamp=2022-08-09T09:09:54.523499Z, snapshot=FALSE, schema=, table=], lastSnapshotRecord=false, lastCompletelyProcessedLsn=LSN{6E3/6B12EEC0}, lastCommitLsn=LSN{6E3/6B12F1F0}, streamingStoppingLsn=null, transactionContext=TransactionContext [currentTransactionId=null, perTableEventCount={}, totalEventCount=0], incrementalSnapshotContext=IncrementalSnapshotContext [windowOpened=false, chunkEndPosition=null, dataCollectionsToSnapshot=[], lastEventKeySent=null, maximumKey=null]] (io.debezium.connector.postgresql.PostgresConnectorTask) [2022-08-09 09:15:55,153] INFO Snapshots are not allowed as per configuration, starting streaming logical changes only (io.debezium.connector.postgresql.snapshot.NeverSnapshotter) [2022-08-09 09:15:55,171] INFO Requested thread factory for connector PostgresConnector, id = us01.service named = change-event-source-coordinator (io.debezium.util.Threads) [2022-08-09 09:15:55,174] INFO Creating thread debezium-postgresconnector-us01.service-change-event-source-coordinator (io.debezium.util.Threads) [2022-08-09 09:15:55,175] INFO WorkerSourceTask{id=cdc-service-events-initialized-0} Source task finished initialization and start (org.apache.kafka.connect.runtime.WorkerSourceTask) [2022-08-09 09:15:55,175] INFO WorkerSourceTask{id=cdc-service-events-initialized-0} Executing source task (org.apache.kafka.connect.runtime.WorkerSourceTask) [2022-08-09 09:15:55,177] INFO Metrics registered (io.debezium.pipeline.ChangeEventSourceCoordinator) [2022-08-09 09:15:55,179] INFO Context created (io.debezium.pipeline.ChangeEventSourceCoordinator) [2022-08-09 09:15:55,180] INFO According to the connector configuration no snapshot will be executed (io.debezium.connector.postgresql.PostgresSnapshotChangeEventSource) [2022-08-09 09:15:55,181] INFO Snapshot ended with SnapshotResult [status=SKIPPED, offset=PostgresOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.postgresql.Source:STRUCT}, sourceInfo=source_info[server='us01.service'db='service', lsn=LSN{6E3/6B12F1F0}, txId=1990942187, lastCommitLsn=LSN{6E3/6B12F1F0}, timestamp=2022-08-09T09:09:54.523499Z, snapshot=FALSE, schema=, table=], lastSnapshotRecord=false, lastCompletelyProcessedLsn=LSN{6E3/6B12EEC0}, lastCommitLsn=LSN{6E3/6B12F1F0}, streamingStoppingLsn=null, transactionContext=TransactionContext [currentTransactionId=null, perTableEventCount={}, totalEventCount=0], incrementalSnapshotContext=IncrementalSnapshotContext [windowOpened=false, chunkEndPosition=null, dataCollectionsToSnapshot=[], lastEventKeySent=null, maximumKey=null]]] (io.debezium.pipeline.ChangeEventSourceCoordinator) [2022-08-09 09:15:55,389] INFO Starting streaming (io.debezium.pipeline.ChangeEventSourceCoordinator) [2022-08-09 09:15:55,389] INFO Retrieved latest position from stored offset 'LSN{6E3/6B12EEC0}' (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource) [2022-08-09 09:15:55,389] INFO Looking for WAL restart position for last commit LSN 'LSN{6E3/6B12F1F0}' and last change LSN 'LSN{6E3/6B12EEC0}' (io.debezium.connector.postgresql.connection.WalPositionLocator) [2022-08-09 09:15:55,389] INFO Initializing PgOutput logical decoder publication (io.debezium.connector.postgresql.connection.PostgresReplicationConnection) [2022-08-09 09:15:55,426] INFO Obtained valid replication slot ReplicationSlot [active=false, latestFlushedLsn=LSN{6E3/6B12B928}, catalogXmin=1990935203] (io.debezium.connector.postgresql.connection.PostgresConnection) [2022-08-09 09:15:55,427] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection) [2022-08-09 09:15:55,445] INFO Requested thread factory for connector PostgresConnector, id = us01.service named = keep-alive (io.debezium.util.Threads) [2022-08-09 09:15:55,446] INFO Creating thread debezium-postgresconnector-us01.service-keep-alive (io.debezium.util.Threads) [2022-08-09 09:15:55,612] INFO Searching for WAL resume position (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource) [2022-08-09 09:15:57,637] INFO First LSN 'LSN{6E3/6B12EB40}' received (io.debezium.connector.postgresql.connection.WalPositionLocator) [2022-08-09 09:15:57,662] INFO LSN after last stored change LSN 'LSN{6E3/6B12F040}' received (io.debezium.connector.postgresql.connection.WalPositionLocator) [2022-08-09 09:15:57,662] INFO WAL resume position 'LSN{6E3/6B12F040}' discovered (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource) [2022-08-09 09:15:57,664] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection) [2022-08-09 09:15:57,666] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection) [2022-08-09 09:15:57,683] INFO Initializing PgOutput logical decoder publication (io.debezium.connector.postgresql.connection.PostgresReplicationConnection) [2022-08-09 09:15:57,697] INFO Requested thread factory for connector PostgresConnector, id = us01.service named = keep-alive (io.debezium.util.Threads) [2022-08-09 09:15:57,698] INFO Creating thread debezium-postgresconnector-us01.service-keep-alive (io.debezium.util.Threads) [2022-08-09 09:15:57,698] INFO Processing messages (io.debezium.connector.postgresql.PostgresStreamingChangeEventSource) [2022-08-09 09:15:59,737] INFO Streaming requested from LSN LSN{6E3/6B12EEC0}, received LSN LSN{6E3/6B12EB40} identified as already processed (io.debezium.connector.postgresql.connection.AbstractMessageDecoder) [2022-08-09 09:15:59,738] INFO Streaming requested from LSN LSN{6E3/6B12EEC0}, received LSN LSN{0/0} identified as already processed (io.debezium.connector.postgresql.connection.AbstractMessageDecoder) [2022-08-09 09:15:59,750] INFO Streaming requested from LSN LSN{6E3/6B12EEC0}, received LSN LSN{6E3/6B12EB40} identified as already processed (io.debezium.connector.postgresql.connection.AbstractMessageDecoder) at java.base/java.lang.Thread.run(Thread.java:829) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:109) at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:141) at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:174) at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:41) at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:177) at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.processMessages(PostgresStreamingChangeEventSource.java:215) at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$1.readPending(PostgresReplicationConnection.java:498) at io.debezium.connector.postgresql.connection.pgoutput.PgOutputMessageDecoder.shouldMessageBeSkipped(PgOutputMessageDecoder.java:133) at io.debezium.connector.postgresql.connection.AbstractMessageDecoder.shouldMessageBeSkipped(AbstractMessageDecoder.java:44) at io.debezium.connector.postgresql.connection.WalPositionLocator.skipMessage(WalPositionLocator.java:141) io.debezium.DebeziumException: Message with LSN 'LSN{6E3/6B12F1F0}' larger than expected LSN 'LSN{6E3/6B12F040}'. This is unexpected and can lead to an infinite loop or a data loss. [2022-08-09 09:15:59,750] ERROR Producer failure (io.debezium.pipeline.ErrorHandler) [2022-08-09 09:15:59,755] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection) [2022-08-09 09:15:59,760] INFO Connection gracefully closed (io.debezium.jdbc.JdbcConnection) [2022-08-09 09:15:59,760] INFO Finished streaming (io.debezium.pipeline.ChangeEventSourceCoordinator) [2022-08-09 09:15:59,760] INFO Connected metrics set to 'false' (io.debezium.pipeline.ChangeEventSourceCoordinator) [2022-08-09 09:16:00,185] INFO WorkerSourceTask{id=cdc-service-events-initialized-0} flushing 0 outstanding messages for offset commit (org.apache.kafka.connect.runtime.WorkerSourceTask) ... 9 more at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:177) at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.processMessages(PostgresStreamingChangeEventSource.java:215) at io.debezium.connector.postgresql.connection.PostgresReplicationConnection$1.readPending(PostgresReplicationConnection.java:498) at io.debezium.connector.postgresql.connection.pgoutput.PgOutputMessageDecoder.shouldMessageBeSkipped(PgOutputMessageDecoder.java:133) at io.debezium.connector.postgresql.connection.AbstractMessageDecoder.shouldMessageBeSkipped(AbstractMessageDecoder.java:44) at io.debezium.connector.postgresql.connection.WalPositionLocator.skipMessage(WalPositionLocator.java:141) Caused by: io.debezium.DebeziumException: Message with LSN 'LSN{6E3/6B12F1F0}' larger than expected LSN 'LSN{6E3/6B12F040}'. This is unexpected and can lead to an infinite loop or a data loss. at java.base/java.lang.Thread.run(Thread.java:829) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:109) at io.debezium.pipeline.ChangeEventSourceCoordinator.executeChangeEventSources(ChangeEventSourceCoordinator.java:141) at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:174) at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:41) at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:180) at io.debezium.pipeline.ErrorHandler.setProducerThrowable(ErrorHandler.java:50) org.apache.kafka.connect.errors.ConnectException: An exception occurred in the change event producer. This connector will be stopped. [2022-08-09 09:16:00,185] ERROR WorkerSourceTask{id=cdc-service-events-initialized-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask) [2022-08-09 09:16:00,186] INFO Stopping down connector (io.debezium.connector.common.BaseSourceTask) [2022-08-09 09:16:00,186] INFO [Producer clientId=connector-producer-cdc-service-events-initialized-0] Closing the Kafka producer with timeoutMillis = 30000 ms. (org.apache.kafka.clients.producer.KafkaProducer)
How to reproduce the issue using our tutorial deployment?
We are unable to reproduce this issue on demand.