Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-2310

Skipping LSN is inefficient and does not forward slot position

    XMLWordPrintable

Details

    • Bug
    • Resolution: Done
    • Major
    • 1.3.0.Alpha1
    • 1.1.2.Final, 1.2.0.Final
    • postgresql-connector
    • None
      1. Run postgresql v1.1.2 connector
      2. Have some transactions which produce a lot of WAL logs
      3. Have a lot of small transactions at the same time
      4. Kill the connector while its processing such a large transaction

    Description

      Hello everyone,

      thanks for all your work in debezium. It is truly an amazing tool. Thank you!

      We are running debezium postgresql connector 1.1.2 against an RDS database. The other day we restarted kafka-connect and something strange happened. After the connector was restarted (after a unclean shutdown), it started to log these messages:

      Streaming requested from LSN 88198654860672 but received LSN 88198094438920 that is same or smaller so skipping the message
      

      I am unsure why this happened, but the effects are bad:

      • The replication slot is not advanced, therefore the postgres server keeps its WAL files (so a restart while catching up erases the progress)
      • The connector is not producing new messages, therefore no changes arrive at downstream systems
      • Catching up is extremely slow, therefore sometimes the gap increased

      While doing this the connector was not producing production relevant data, therefore we were replacing it and where creating a new snapshot
       
      After reading through the source code I think I was able to pinpoint the problem to this method:
       

                  @Override
                  public boolean readPending(ReplicationMessageProcessor processor) throws SQLException, InterruptedException {
                      ByteBuffer read = stream.readPending();
                      final long lastReceiveLsn = stream.getLastReceiveLSN().asLong();
                      LOGGER.trace("Streaming requested from LSN {}, received LSN {}", startingLsn, lastReceiveLsn);
      
                      if (read == null || messageDecoder.shouldMessageBeSkipped(read, lastReceiveLsn, startingLsn, skipFirstFlushRecord)) {
                          return false;
                      }
      
                      deserializeMessages(read, processor);
      
                      return true;
                  }
      

      https://github.com/debezium/debezium/blob/ee91eac4183b15ec6db449f479d5de27e1c662be/debezium-connector-postgres/src/main/java/io/debezium/connector/postgresql/connection/PostgresReplicationConnection.java#L400-L413

      If I reading this correctly, whenever messages are skipped this is considering as a "no pending messages" and therefore is not progressing the replication slot and actually waiting the poll interval for new messages.

      I think the method should look like this:

                  @Override
                  public boolean readPending(ReplicationMessageProcessor processor) throws SQLException, InterruptedException {
                      ByteBuffer read = stream.readPending();
                      final long lastReceiveLsn = stream.getLastReceiveLSN().asLong();
                      LOGGER.trace("Streaming requested from LSN {}, received LSN {}", startingLsn, lastReceiveLsn);
      
                      if (read == null) {
                          return false;
                      }
      
                      if (messageDecoder.shouldMessageBeSkipped(read, lastReceiveLsn, startingLsn, skipFirstFlushRecord)) {
                          return true;
                      }
      
                      deserializeMessages(read, processor);
      
                      return true;
                  }
      

      (Patch is attached)

      Thank you for considering and reading this.

      Attachments

        Activity

          People

            Unassigned Unassigned
            bjoernhaeuser Björn Häuser (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: