Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-2989

Postgres connector cannot reuse previously created replication slot

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • None
    • None
    • False
    • False
    • 1.4
    • Undefined
    • Hide

      Restart "RUNNING" or "FAILED" task when replication slot in Postgres exists and free.
      It 100% reproduced for 1 of 3 my DBs.

      Show
      Restart "RUNNING" or "FAILED" task when replication slot in Postgres exists and free. It 100% reproduced for 1 of 3 my DBs.

      Connector can't reconnect to Postgres to its replication slot, while its free.
      Error:

      2021-01-19 21:48:43,824 INFO   Postgres|ccp_ocm.dev|postgres-connector-task  Searching for WAL resume position   [io.debezium.connector.postgresql.PostgresStreamingChangeEventSource]
      2021-01-19 21:48:48,883 INFO   Postgres|ccp_ocm.dev|postgres-connector-task  First LSN 'LSN{E0/EE0B8CC8}' received   [io.debezium.connector.postgresql.connection.WalPositionLocator]
      2021-01-19 21:48:48,883 INFO   Postgres|ccp_ocm.dev|postgres-connector-task  WAL resume position 'LSN{E0/EE0B8CC8}' discovered   [io.debezium.connector.postgresql.PostgresStreamingChangeEventSource]
      2021-01-19 21:48:48,892 INFO   Postgres|ccp_ocm.dev|postgres-connector-task  Initializing PgOutput logical decoder publication   [io.debezium.connector.postgresql.connection.PostgresReplicationConnection]
      2021-01-19 21:48:48,895 ERROR  Postgres|ccp_ocm.dev|postgres-connector-task  Producer failure   [io.debezium.pipeline.ErrorHandler]
      io.debezium.DebeziumException: Failed to start replication stream at LSN{E0/EE0B8C90}; when setting up multiple connectors for the same database host, please make sure to use a distinct replication slot name for each.
          at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.startStreaming(PostgresReplicationConnection.java:308)
          at io.debezium.connector.postgresql.PostgresStreamingChangeEventSource.execute(PostgresStreamingChangeEventSource.java:143)
          at io.debezium.pipeline.ChangeEventSourceCoordinator.streamEvents(ChangeEventSourceCoordinator.java:140)
          at io.debezium.pipeline.ChangeEventSourceCoordinator.lambda$start$0(ChangeEventSourceCoordinator.java:113)
          at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
          at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
          at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
          at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
          at java.base/java.lang.Thread.run(Thread.java:834)
      Caused by: org.postgresql.util.PSQLException: ERROR: replication slot "cdc_ocm_slot" is active for PID 6899
          at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2532)
          at org.postgresql.core.v3.QueryExecutorImpl.processCopyResults(QueryExecutorImpl.java:1200)
          at org.postgresql.core.v3.QueryExecutorImpl.startCopy(QueryExecutorImpl.java:882)
          at org.postgresql.core.v3.replication.V3ReplicationProtocol.initializeReplication(V3ReplicationProtocol.java:58)
          at org.postgresql.core.v3.replication.V3ReplicationProtocol.startLogical(V3ReplicationProtocol.java:42)
          at org.postgresql.replication.fluent.ReplicationStreamBuilder$1.start(ReplicationStreamBuilder.java:38)
          at org.postgresql.replication.fluent.logical.LogicalStreamBuilder.start(LogicalStreamBuilder.java:37)
          at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.startPgReplicationStream(PostgresReplicationConnection.java:569)
          at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.createReplicationStream(PostgresReplicationConnection.java:406)
          at io.debezium.connector.postgresql.connection.PostgresReplicationConnection.startStreaming(PostgresReplicationConnection.java:301)
          ... 8 more
      

      When i try restart task, it fails with same error. Only PID changes.
      It happens allways after connection lost to Postgres or after restart/recreate debezium container with the same connector or restart task.
      In postgres there is no connected clients to this slot:

      postgres=# SELECT * FROM pg_replication_slots where slot_name='cdc_ocm_slot';
      
      slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
      --------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
       cdc_ocm_slot | pgoutput | logical   | 415842 | ccp_ocm  | f         | f      |            |      |     94650945 | DF/1921AAE8 | DF/19220DC8
      (1 row)

       
      I found only one way to fix it: delete replaction slot in postgres manually with select pg_drop_replication_slot('cdc_ocm_slot') and then restart task.

      I can start streaming from this slot manually with and see WAL records in console:

      pg_recvlogical --slot=cdc_ocm_slot -v  -d ccp_ocm --host db-hostname --username user -o proto_version=1 -o publication_names='dbz_publication' --file -  --start
      

      I noticed, that there are 2 PID apears in replication slot after restart consistently: first for a small amount time, and second during all work. And when task failed, it says that slot is busy with first PID.

      One restart (when slot is free) then 2 PIDs:

      postgres=# SELECT * FROM pg_replication_slots where slot_name='cdc_ocm_slot';
        slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
      --------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
       cdc_ocm_slot | pgoutput | logical   | 415842 | ccp_ocm  | f         | f      |            |      |     96647723 | E0/EEB268C8 | E0/EEB26900
      (1 row)
      
      postgres=# SELECT * FROM pg_replication_slots where slot_name='cdc_ocm_slot';
        slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
      --------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
       cdc_ocm_slot | pgoutput | logical   | 415842 | ccp_ocm  | f         | t      |       7330 |      |     96647723 | E0/EEB268C8 | E0/EEB26900
      (1 row)
      
      postgres=# SELECT * FROM pg_replication_slots where slot_name='cdc_ocm_slot';
        slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
      --------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
       cdc_ocm_slot | pgoutput | logical   | 415842 | ccp_ocm  | f         | t      |       7332 |      |     96647723 | E0/EEB268C8 | E0/EEB26900
      (1 row)
      
      postgres=# SELECT * FROM pg_replication_slots where slot_name='cdc_ocm_slot';
        slot_name   |  plugin  | slot_type | datoid | database | temporary | active | active_pid | xmin | catalog_xmin | restart_lsn | confirmed_flush_lsn
      --------------+----------+-----------+--------+----------+-----------+--------+------------+------+--------------+-------------+---------------------
       cdc_ocm_slot | pgoutput | logical   | 415842 | ccp_ocm  | f         | t      |       7332 |      |     96647723 | E0/EEB268C8 | E0/EEB26900
      (1 row)
      

            Unassigned Unassigned
            alecx.sitex Aleksey Dergachev (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: