Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-5163

Debezium Postgres Connector goes into unrecoverable state after reconnecting

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • 1.9.2.Final
    • postgresql-connector
    • None
    • False
    • None
    • False
    • Hide

      We do not have a way to reliably reproduce it.

      Show
      We do not have a way to reliably reproduce it.

      In order to make your issue reports as actionable as possible, please provide the following information, depending on the issue type.

      Bug report

      For bug reports, provide this information, please:

      What Debezium connector do you use and what version?

      PostgreSQL connector version 1.9.2.Final

      What is the connector configuration?

      I have removed database.hostname, database.port, database.user, database.password, database.dbname, database.server.name, and heartbeat.action.query. All other settings are identical.  

       

      {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "tasks.max": "1",
        "decimal.handling.mode": "string",
        "plugin.name": "pgoutput",
        "slot.name" : "debezium",
        "connector.client.config.override.policy": "All",
        "event.processing.failure.handling.mode": "skip",
        "producer.max.request.size": "8388608",
        "max.request.size": "8388608",
        "producer.override.max.request.size" : "8388608", 
        "heartbeat.interval.ms": 60000,
        "snapshot.mode": "never",
        "topic.creation.default.replication.factor": 3,
        "topic.creation.default.partitions": 3,
        "topic.creation.default.compression.type": "snappy",
        "topic.creation.default.max.message.bytes": 8388608
      }
      

       

      What is the captured database version and mode of depoyment?

      (E.g. on-premises, with a specific cloud provider, etc.)

      AWS RDS PostgreSQL v11.7

      What behaviour do you expect?

      After disconnecting, Debezium to reconnect and pick up from the place it left off in the transaction logs.

      What behaviour do you see?

      After leaving Debezium on for a long enough timescale, eventually it will disconnect from Postgres, resulting in the following error and stack trace:

      2022-05-18T17:22:30.335-05:00 ERROR Postgres|redacted|streaming Producer failure [io.debezium.pipeline.ErrorHandler] 
      2022-05-18T17:22:30.335-05:00 org.postgresql.util.PSQLException: Database connection failed when writing to copy 
      2022-05-18T17:22:30.335-05:00 at org.postgresql.core.v3.QueryExecutorImpl.flushCopy(QueryExecutorImpl.java:1143) 
      2022-05-18T17:22:30.335-05:00 at org.postgresql.core.v3.CopyDualImpl.flushCopy(CopyDualImpl.java:30) 
      ...
      2022-05-18T17:22:30.335-05:00 Caused by: java.net.SocketException: Connection or outbound has closed 222022-05-18T17:22:30.335-05:00 at java.base/sun.security.ssl.SSLSocketImpl$AppOutputStream.write(SSLSocketImpl.java:1297) 
      2022-05-18T17:22:30.335-05:00 at java.base/java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:81)...

      Debezium, however, is capable of reconnecting these disconnections and, as expected, it reconnects to the database, as evidenced by the logs:

      2022-05-18T17:22:30.615-05:00    INFO || Awaiting end of restart backoff period after a retriable error [io.debezium.connector.common.BaseSourceTask]
      ...
      2022-05-18T17:22:40.619-05:00    INFO || Starting PostgresConnectorTask with configuration: [io.debezium.connector.common.BaseSourceTask]
      ...
      2022-05-18T17:22:40.996-05:00   INFO || Found previous partition offset PostgresPartition [sourcePartition={serverredacted}]: {transaction_id=null, lsn_proc=81552370930144, lsn_commit=81552370930144, lsn=81552370930144, txId=343988299, ts_usec=1652912499319760}   [io.debezium.connector.common.BaseSourceTask]
      ...
      2022-05-18T17:22:41.002-05:00   INFO Postgres|redacted|postgres-connector-task  user 'debezium' connected to database 'redacted' on PostgreSQL 11.7 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.3 20140911 (Red Hat 4.8.3-9), 64-bit with roles:
      ...
      2022-05-18T17:22:41.006-05:00   INFO Postgres|redacted|postgres-connector-task Obtained valid replication slot ReplicationSlot [active=false, latestFlushedLsn=LSN{4A2B/E4174768}, catalogXmin=343987943] [io.debezium.connector.postgresql.connection.PostgresConnection]
      ...
      2022-05-18T17:22:41.006-05:00   INFO Postgres|redacted|postgres-connector-task  Found previous offset PostgresOffsetContext [sourceInfoSchema=Schema{io.debezium.connector.postgresql.Source:STRUCT}, sourceInfo=source_info[server='redacted'db='redacted', lsn=LSN{4A2B/E41991E0}, txId=343988299, lastCommitLsn=LSN{4A2B/E41991E0}, timestamp=2022-05-18T22:21:39.319760Z, snapshot=FALSE, schema=, table=], lastSnapshotRecord=false, lastCompletelyProcessedLsn=LSN{4A2B/E41991E0}, lastCommitLsn=LSN{4A2B/E41991E0}, streamingStoppingLsn=null, transactionContext=TransactionContext [currentTransactionId=null, perTableEventCount={}, totalEventCount=0], incrementalSnapshotContext=IncrementalSnapshotContext [windowOpened=false, chunkEndPosition=null, dataCollectionsToSnapshot=[], lastEventKeySent=null, maximumKey=null]]   [io.debezium.connector.postgresql.PostgresConnectorTask]

      However, once it attempts to find the last LSN, we see that the LSN reported by Debezium and the LSN reported by postgres are different, over 100,000 apart.
      Debezium:

      2022-05-18 22:22:43,028 INFO   Postgres|redacted|streaming Looking for WAL restart position for last commit LSN 'LSN{4A2B/E41991E0}' and last change LSN 'LSN{4A2B/E41991E0}'   [io.debezium.connector.postgresql.connection.WalPositionLocator] 

      Postgres:

      2022-05-18 22:22:43 UTC debezium@redacted:[5540]:DETAIL:  Streaming transactions committing after 4A2B/E4174768, reading WAL from 4A2B/DC14B7A8. 
      2022-05-18 22:22:43 UTC debezium@redacted:[5540]:LOG:  logical decoding found consistent point at 4A2B/DC14B7A8
      2022-05-18 22:22:43 UTC debezium@redacted:[5540]:DETAIL:  Logical decoding will begin using saved snapshot. 

      In addition, when Debezium searches for the LSN, we see it search for a long time, over 40 minutes, before we shut it down. The LSNs it compares against are all invariably larger than the LSN Debezium has as the last LSN, but it marks them all as skippable.

      2022-05-18T17:22:45.452-05:00 INFO   Postgres|redacted|streaming  Streaming requested from LSN LSN{4A2B/E41991E0}, received LSN LSN{4A2B/E41D2550} identified as already processed   [io.debezium.connector.postgresql.connection.AbstractMessageDecoder]

      There are well over 10,000 of the previous message, with the "received" LSNs varying, but all of them are greater than the "requested" LSN and there are a large number of duplicates.
      Despite this, the heartbeat records do continue to be written, meaning the connection is definitely real, but we fail to get any CDC information whatsoever from postgres, with Debezium only continually checking and skipping every message.

      It's unclear why the LSN's don't match between Debezium and Postgres, or what it's really doing for 40 minutes, but with no CDC whatsoever, its clear that this is a large problem, and we have had it happen on multiple occaisions. We did see comments in the source code for io.debezium.connector.postgresql.connection.WalPositionLocator that would indicate that the describe needing to "find" the equivalent LSN as a starting point, as well as it being necessarily unordered, but is it really legitimate that after 40 minutes it would still not have found that value?

      The nearest similar issue is DBZ-4865, but it's hard to tell if this is actually the same issue or a distinct issue and solution.

      Edited 26-5-2022:
      After another overnight run, this issue has happened again. The reported error by Postgres is the same, and as with before, the disconnected time was very short and it had reconnected well within the same minute. The LSNs from this run, as reported in Debezium's logs, are:

      Looking for WAL restart position for last commit LSN 'LSN{4A8B/FFACFA8}' and last change LSN 'LSN{4A8B/FFACFA8}'   [io.debezium.connector.postgresql.connection.WalPositionLocator]
      
      First LSN 'LSN{4A8B/178BBF38}' received [io.debezium.connector.postgresql.connection.WalPositionLocator]
      
      Received COMMIT LSN 'LSN{4A8B/178BDFA0}' larger than than last stored commit LSN 'LSN{4A8B/FFACFA8}'   [io.debezium.connector.postgresql.connection.WalPositionLocator]
      
      Will restart from LSN 'LSN{4A8B/178BBF38}' that is start of the first unprocessed transaction   [io.debezium.connector.postgresql.connection.WalPositionLocator]
      
      WAL resume position 'LSN{4A8B/178BBF38}' discovered   [io.debezium.connector.postgresql.PostgresStreamingChangeEventSource]
      
      Streaming requested from LSN LSN{4A8B/FFACFA8}, received LSN LSN{4A8B/1F4E7408} identified as already processed   [io.debezium.connector.postgresql.connection.AbstractMessageDecoder] 

      before continuing to spiral, as before. The corresponding LSNs according to Postgres's logs are as follows:

      DETAIL:  Streaming transactions committing after 4A8B/178BDFA0, reading WAL from 4A8A/681032D0.
      
      LOG:  logical decoding found consistent point at 4A8A/681032D0 

      As with the previous example, the LSN's don't seem to match up, with only the 4A8B/178BDFA0 shared between them. When calculated in hexadecimal, the difference between these LSN's seems massive.

      4A8B/178BDFA0 - 4A8A/681032D0 = 2,944,117,968
      4A8B/178BDFA0 - 4A8B/FFACFA8 = 126,947,320

      These calculations were done with SELECT statements directly in Postgres.

      In addition, we have found a couple of other open tickets recording a similar issue: 

      1. DBZ-3471
      2. DBZ-5031

      Do you see the same behaviour using the latest released Debezium version?

      (Ideally, also verify with latest Alpha/Beta/CR version)

      Yes.

      Do you have the connector logs, ideally from start till finish?

      (You might be asked later to provide DEBUG/TRACE level log)

      Yes, in AWS CloudWatch.

      How to reproduce the issue using our tutorial deployment?

      We do not know what it causing the disconnects, nor how to reliably replicate it.

              vjuranek@redhat.com Vojtech Juranek
              kurt.hardee Kurt Hardee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: