Loading...

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: 1.4.0.Alpha2
Affects Version/s: 1.3.0.CR1
Component/s: postgresql-connector
Labels:
None

Blocked:
False
Ready:
False
Release Note Text:
Undefined
Steps to Reproduce:
Hide

Declare Debezium connector to Kafka Connect

Wait for snapshot to end

Use pgbench to write into the database and trigger the heartbeats

Write on other databases

Wait at least the heartbeat interval time to ensure the slot LSN is moved forward

Do a failover with patronictl or shut down the instance manually
Show
Declare Debezium connector to Kafka Connect Wait for snapshot to end Use pgbench to write into the database and trigger the heartbeats Write on other databases Wait at least the heartbeat interval time to ensure the slot LSN is moved forward Do a failover with patronictl or shut down the instance manually
Market:

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Patroni is an high-availability operator that controls PostgreSQL process lifecycle (initdb, start, stop, promote) and manages replication. It constantly tries to acquire a lock on a distributed data store (DCS) and, when it fails to do so, another node will be promoted by acquiring the same lock somewhere else.

When Patroni is used to manage PostgreSQL failovers and Debezium is connected, a forced switchover is blocked in shutting down state.

patroni-<redacted> ~ # patronictl switchover --force
Current cluster topology
+ Cluster: cluster99 (6571679703481550190) ----------------+--------+---------+-----+-----------+---------------------+
|                 Member                 |       Host      |  Role  |  State  |  TL | Lag in MB | Tags                |
+----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
|           patroni-<redacted>           |    <redacted>   |        | running | 137 |         0 | clonefrom: true     |
|                                        |                 |        |         |     |           | nofailover: true    |
|                                        |                 |        |         |     |           | noloadbalance: true |
+----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
|           patroni-<redacted>           |    <redacted>   |        | running | 137 |         0 |                     |
+----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
|           patroni-<redacted>           |    <redacted>   | Leader | running | 137 |           |                     |
+----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+
2020-09-25 18:19:55,586 - WARNING - /usr/lib/python3/dist-packages/urllib3/connectionpool.py:845: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

Switchover failed, details: 503, Switchover status unknown

No new connection is allowed. The following processes are still alive:

postgres 30055  0.0  0.8 54220116 1145376 ?    S    17:07   0:01 /usr/lib/postgresql/9.6/bin/postgres -D /data/postgresql/main --config-file=/etc/postgresql/9.6/main/postgresql.conf --por
postgres 30358  0.2  2.7 54222604 3567260 ?    Ss   17:07   0:10  \_ postgres: cluster99: checkpointer process
postgres 30360  0.0  0.0 149556  5232 ?        Ss   17:07   0:01  \_ postgres: cluster99: stats collector process
postgres 38771  0.0  0.0 149284  5100 ?        Ss   17:16   0:00  \_ postgres: cluster99: archiver process   last was 0000008900000300000000B1
postgres 38830  0.6  0.0 54221832 12500 ?      Ss   17:16   0:25  \_ postgres: cluster99: wal sender process repl <redacted>(21456) streaming 300/B2E9F418
postgres 38990  0.6  0.0 54221728 12444 ?      Ss   17:16   0:24  \_ postgres: cluster99: wal sender process repl <redacted>(20956) streaming 300/B2E9F418
postgres 46186 33.5  0.0 54225416 21360 ?      Rs   18:18   0:45  \_ postgres: cluster99: wal sender process debezium <redacted>(22660) idle

When we kill the debezium backend process from the operating system, the instance is able to shut down and failover completes successfully.

Heartbeat is enabled:

{
  "heartbeat.interval.ms": "5000",
  "heartbeat.action.query": "update debezium set ts=now();"
}

The query is executed regularly:

test=# select * from debezium ;
              ts
-------------------------------
 2020-09-25 18:18:49.719774+02
(1 row)

test=# select * from debezium ;
              ts
-------------------------------
 2020-09-25 18:18:54.782727+02
(1 row)

But the lag never goes down to zero:

postgres=# select slot_name, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn)) as restart_lsn_lag, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), confirmed_flush_lsn)) as confirmed_flush_lsn_lag, restart_lsn, confirmed_flush_lsn from pg_replication_slots;
               slot_name                | restart_lsn_lag | confirmed_flush_lsn_lag | restart_lsn  | confirmed_flush_lsn
----------------------------------------+-----------------+-------------------------+--------------+---------------------
 patroni_<redacted>                     | 0 bytes         |                         | 300/B2E9EEE8 |
 patroni_<redacted>                     | 0 bytes         |                         | 300/B2E9EEE8 |
 debezium_test                          | 912 bytes       | 656 bytes               | 300/B2E9EB58 | 300/B2E9EC58
(3 rows)

The shutdown mode used is "fast" which means it kills all client connections without waiting for them to disconnect. That also kills the heartbeat connection. But not the wal streaming connection.

The bug happens on:

Patroni 1.6.5
PostgreSQL 9.6.19
Debezium 1.3.0.CR1
wal2json 2.3-1.pgdg90+1
Kafka 2.6.0
Debian 9.13

This issue is similar to ~~DBZ-1727~~ but it happens on a more recent version.

relates to

DBZ-2685 Support PostgreSQL connector retry when database is restarted

Closed

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates