Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-2617

Patroni can't stop PostgreSQL when Debezium is streaming

    XMLWordPrintable

    Details

    • Steps to Reproduce:
      Hide
      1. Declare Debezium connector to Kafka Connect
      2. Wait for snapshot to end
      3. Use pgbench to write into the database and trigger the heartbeats
      4. Write on other databases
      5. Wait at least the heartbeat interval time to ensure the slot LSN is moved forward
      6. Do a failover with patronictl or shut down the instance manually
      Show
      Declare Debezium connector to Kafka Connect Wait for snapshot to end Use pgbench to write into the database and trigger the heartbeats Write on other databases Wait at least the heartbeat interval time to ensure the slot LSN is moved forward Do a failover with patronictl or shut down the instance manually

      Description

      Patroni is an high-availability operator that controls PostgreSQL process lifecycle (initdb, start, stop, promote) and manages replication. It constantly tries to acquire a lock on a distributed data store (DCS) and, when it fails to do so, another node will be promoted by acquiring the same lock somewhere else.

      When Patroni is used to manage PostgreSQL failovers and Debezium is connected, a forced switchover is blocked in shutting down state.

      patroni-<redacted> ~ # patronictl switchover --force
      Current cluster topology
      + Cluster: cluster99 (6571679703481550190) ----------------+--------+---------+-----+-----------+---------------------+
      |                 Member                 |       Host      |  Role  |  State  |  TL | Lag in MB | Tags                |
      +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
      |           patroni-<redacted>           |    <redacted>   |        | running | 137 |         0 | clonefrom: true     |
      |                                        |                 |        |         |     |           | nofailover: true    |
      |                                        |                 |        |         |     |           | noloadbalance: true |
      +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
      |           patroni-<redacted>           |    <redacted>   |        | running | 137 |         0 |                     |
      +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
      |           patroni-<redacted>           |    <redacted>   | Leader | running | 137 |           |                     |
      +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
      2020-09-25 18:19:55,586 - WARNING - /usr/lib/python3/dist-packages/urllib3/connectionpool.py:845: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
        InsecureRequestWarning)
      
      Switchover failed, details: 503, Switchover status unknown
      

      No new connection is allowed. The following processes are still alive:

      postgres 30055  0.0  0.8 54220116 1145376 ?    S    17:07   0:01 /usr/lib/postgresql/9.6/bin/postgres -D /data/postgresql/main --config-file=/etc/postgresql/9.6/main/postgresql.conf --por
      postgres 30358  0.2  2.7 54222604 3567260 ?    Ss   17:07   0:10  \_ postgres: cluster99: checkpointer process
      postgres 30360  0.0  0.0 149556  5232 ?        Ss   17:07   0:01  \_ postgres: cluster99: stats collector process
      postgres 38771  0.0  0.0 149284  5100 ?        Ss   17:16   0:00  \_ postgres: cluster99: archiver process   last was 0000008900000300000000B1
      postgres 38830  0.6  0.0 54221832 12500 ?      Ss   17:16   0:25  \_ postgres: cluster99: wal sender process repl <redacted>(21456) streaming 300/B2E9F418
      postgres 38990  0.6  0.0 54221728 12444 ?      Ss   17:16   0:24  \_ postgres: cluster99: wal sender process repl <redacted>(20956) streaming 300/B2E9F418
      postgres 46186 33.5  0.0 54225416 21360 ?      Rs   18:18   0:45  \_ postgres: cluster99: wal sender process debezium <redacted>(22660) idle
      

      When we kill the debezium backend process from the operating system, the instance is able to shut down and failover completes successfully.

      Heartbeat is enabled:

      {
        "heartbeat.interval.ms": "5000",
        "heartbeat.action.query": "update debezium set ts=now();"
      }
      

      The query is executed regularly:

      test=# select * from debezium ;
                    ts
      -------------------------------
       2020-09-25 18:18:49.719774+02
      (1 row)
      
      test=# select * from debezium ;
                    ts
      -------------------------------
       2020-09-25 18:18:54.782727+02
      (1 row)
      

      But the lag never goes down to zero:

      postgres=# select slot_name, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn)) as restart_lsn_lag, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), confirmed_flush_lsn)) as confirmed_flush_lsn_lag, restart_lsn, confirmed_flush_lsn from pg_replication_slots;
                     slot_name                | restart_lsn_lag | confirmed_flush_lsn_lag | restart_lsn  | confirmed_flush_lsn
      ----------------------------------------+-----------------+-------------------------+--------------+---------------------
       patroni_<redacted>                     | 0 bytes         |                         | 300/B2E9EEE8 |
       patroni_<redacted>                     | 0 bytes         |                         | 300/B2E9EEE8 |
       debezium_test                          | 912 bytes       | 656 bytes               | 300/B2E9EB58 | 300/B2E9EC58
      (3 rows)
      

      The shutdown mode used is "fast" which means it kills all client connections without waiting for them to disconnect. That also kills the heartbeat connection. But not the wal streaming connection.

      The bug happens on:

      • Patroni 1.6.5
      • PostgreSQL 9.6.19
      • Debezium 1.3.0.CR1
      • wal2json 2.3-1.pgdg90+1
      • Kafka 2.6.0
      • Debian 9.13

      This issue is similar to DBZ-1727 but it happens on a more recent version.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              rkerner René Kerner
              Reporter:
              jriou Julien Riou (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: