After deploying the Debezium connector for Postgres in our unstable environment we faced our Postgres instance quickly running out of space.

      We have a Postgres instance where we have two databases: production and unstable.

      The connector is configured to capture changes from the unstable database. The changes on unstable are few and the connector works correctly.
      However, the traffic on the production database is quite high and we noticed that the disk space increases monotonically until we almost ran out of disk space.

      After stopping the connector and deleting the replication slot, the disk space was freed up.
      It appears to me that `replication_slots` in Postgres work across databases and the Connector is not listening to the event of other databases.

        1. debezium.png
          42 kB
          Bogdan Ionescu
        2. free.jpg
          65 kB
          Bogdan Ionescu
        3. heartbeat.jpg
          50 kB
          Bogdan Ionescu
        4. leak.JPG
          60 kB
          Bogdan Ionescu
        5. Screen Shot 2018-09-03 at 09.58.53.png
          130 kB
          Andrii Bratus
        6. Screenshot 2018-12-11 at 10.44.35.png
          41 kB
          Andrii Bratus
        7. Screenshot 2018-12-11 at 10.44.54.png
          40 kB
          Andrii Bratus
        8. Transaction Logs Usage.png
          190 kB
          Hayman Yu

            [DBZ-892] Connector Postgres runs out of disk space

            Released

            Jiri Pechanec added a comment - Released

            The doc update is great! I am still of the opinion that a fix to this problem should be included in debezium (dummy updates through the JDBC side-channel), although behind an option so as to not create transactions in db by default. The problem is elusive to a standard user, not the easiest to understand completely and requires a continuously running additional component maintained separately. In a simple environment it's not a big deal, but with restricted environments that sucks. I fail to see the difference compared to heartbeat transactions, it's a very similar problem but with the fix inserting to Postgres instead of Kafka. IIUC from previous comments jpechane disagrees though.

            Tapani Moilanen (Inactive) added a comment - - edited The doc update is great! I am still of the opinion that a fix to this problem should be included in debezium (dummy updates through the JDBC side-channel), although behind an option so as to not create transactions in db by default. The problem is elusive to a standard user, not the easiest to understand completely and requires a continuously running additional component maintained separately. In a simple environment it's not a big deal, but with restricted environments that sucks. I fail to see the difference compared to heartbeat transactions, it's a very similar problem but with the fix inserting to Postgres instead of Kafka. IIUC from previous comments jpechane disagrees though.

            Ok, so this it explains it then: the replication slot is specific to one DB, but the changes only happen in another one. So there's no way the slot can progress while the WAL grows. Quite a tricky limitation of logical decoding really. Thanks for the docs update, this is very helpful.

            Gunnar Morling added a comment - Ok, so this it explains it then: the replication slot is specific to one DB, but the changes only happen in another one. So there's no way the slot can progress while the WAL grows. Quite a tricky limitation of logical decoding really. Thanks for the docs update, this is very helpful.

            gunnar.morlingPlease check the PR I hope it explains the situation comprehensively.

            Jiri Pechanec added a comment - gunnar.morling Please check the PR I hope it explains the situation comprehensively.

            But if there's no traffic, why is the WAL growing then? Is it rather that something is happening in the database but this isn't propagated via logical decoding? What's in these WAL entries? Is it something specific to RDS possibly?

            Gunnar Morling added a comment - But if there's no traffic, why is the WAL growing then? Is it rather that something is happening in the database but this isn't propagated via logical decoding? What's in these WAL entries? Is it something specific to RDS possibly?

            gunnar.morling Please see attached docs PR

            Jiri Pechanec added a comment - gunnar.morling Please see attached docs PR

            The problem is that it seems that no traffic is not that uncommon scenario when Debezium is deployed.

            Jiri Pechanec added a comment - The problem is that it seems that no traffic is not that uncommon scenario when Debezium is deployed.

            Gunnar Morling added a comment - - edited

            jpechane, why is it that the "dummy TX" trick is needed at all? As long as there's any traffic in the DB, we should get changes and be able to send heartbeats. What case am I missing?

            Also, I think this definitely should be described in the FAQs.

            Gunnar Morling added a comment - - edited jpechane , why is it that the "dummy TX" trick is needed at all? As long as there's any traffic in the DB, we should get changes and be able to send heartbeats. What case am I missing? Also, I think this definitely should be described in the FAQs.

            Jiri Pechanec added a comment - - edited

            bogdan.ionescu You are right, the txid_current(); trick really does not work. What works is

            • for shortlived connections create temporary table xx(id int);
            • for long lived connections create temporary table xx(id int); and periodically truncate table xx;

            It means that no data are writtent to the database at all.

            in both cases only wal2json plugin emits empty message that is used to generate the heartbeat message and flush the recent LSN. This was missing in the code and is part of associated PR.

            For other decoders the best approach is to follow https://issues.jboss.org/browse/DBZ-892?focusedCommentId=13696265&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13696265

            Jiri Pechanec added a comment - - edited bogdan.ionescu You are right, the txid_current(); trick really does not work. What works is for shortlived connections create temporary table xx(id int); for long lived connections create temporary table xx(id int); and periodically truncate table xx; It means that no data are writtent to the database at all. in both cases only wal2json plugin emits empty message that is used to generate the heartbeat message and flush the recent LSN. This was missing in the code and is part of associated PR. For other decoders the best approach is to follow https://issues.jboss.org/browse/DBZ-892?focusedCommentId=13696265&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13696265

            As others mentioned, it does not work.
            BEGIN;
            SELECT txid_current();
            COMMIT sadly achieves nothing.
            It would appear the rds 'Oldest replication slot lag' increase with roughly 190MB each hour and it's been fairly constant for nearly 3 days. I've stopped all activity on the database, there are no external hits whatsoever, and the increase rate is constant.
            Inserting or updating records to the monitored db will 'fix' it.

            Bogdan Ionescu (Inactive) added a comment - - edited As others mentioned, it does not work. BEGIN; SELECT txid_current(); COMMIT sadly achieves nothing. It would appear the rds 'Oldest replication slot lag' increase with roughly 190MB each hour and it's been fairly constant for nearly 3 days. I've stopped all activity on the database, there are no external hits whatsoever, and the increase rate is constant. Inserting or updating records to the monitored db will 'fix' it.

              jpechane Jiri Pechanec
              andreybratus Andrii Bratus (Inactive)
              Votes:
              7 Vote for this issue
              Watchers:
              16 Start watching this issue

                Created:
                Updated:
                Resolved: