-
Bug
-
Resolution: Done
-
Major
-
1.3.0.CR1
-
None
-
False
-
False
-
Undefined
-
-
Patroni is an high-availability operator that controls PostgreSQL process lifecycle (initdb, start, stop, promote) and manages replication. It constantly tries to acquire a lock on a distributed data store (DCS) and, when it fails to do so, another node will be promoted by acquiring the same lock somewhere else.
When Patroni is used to manage PostgreSQL failovers and Debezium is connected, a forced switchover is blocked in shutting down state.
patroni-<redacted> ~ # patronictl switchover --force Current cluster topology + Cluster: cluster99 (6571679703481550190) ----------------+--------+---------+-----+-----------+---------------------+ | Member | Host | Role | State | TL | Lag in MB | Tags | +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+ | patroni-<redacted> | <redacted> | | running | 137 | 0 | clonefrom: true | | | | | | | | nofailover: true | | | | | | | | noloadbalance: true | +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+ | patroni-<redacted> | <redacted> | | running | 137 | 0 | | +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+ | patroni-<redacted> | <redacted> | Leader | running | 137 | | | +----------------------------------------+-----------------+--------+---------+-----+-----------+---------------------+ 2020-09-25 18:19:55,586 - WARNING - /usr/lib/python3/dist-packages/urllib3/connectionpool.py:845: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings InsecureRequestWarning) Switchover failed, details: 503, Switchover status unknown
No new connection is allowed. The following processes are still alive:
postgres 30055 0.0 0.8 54220116 1145376 ? S 17:07 0:01 /usr/lib/postgresql/9.6/bin/postgres -D /data/postgresql/main --config-file=/etc/postgresql/9.6/main/postgresql.conf --por postgres 30358 0.2 2.7 54222604 3567260 ? Ss 17:07 0:10 \_ postgres: cluster99: checkpointer process postgres 30360 0.0 0.0 149556 5232 ? Ss 17:07 0:01 \_ postgres: cluster99: stats collector process postgres 38771 0.0 0.0 149284 5100 ? Ss 17:16 0:00 \_ postgres: cluster99: archiver process last was 0000008900000300000000B1 postgres 38830 0.6 0.0 54221832 12500 ? Ss 17:16 0:25 \_ postgres: cluster99: wal sender process repl <redacted>(21456) streaming 300/B2E9F418 postgres 38990 0.6 0.0 54221728 12444 ? Ss 17:16 0:24 \_ postgres: cluster99: wal sender process repl <redacted>(20956) streaming 300/B2E9F418 postgres 46186 33.5 0.0 54225416 21360 ? Rs 18:18 0:45 \_ postgres: cluster99: wal sender process debezium <redacted>(22660) idle
When we kill the debezium backend process from the operating system, the instance is able to shut down and failover completes successfully.
Heartbeat is enabled:
{ "heartbeat.interval.ms": "5000", "heartbeat.action.query": "update debezium set ts=now();" }
The query is executed regularly:
test=# select * from debezium ; ts ------------------------------- 2020-09-25 18:18:49.719774+02 (1 row) test=# select * from debezium ; ts ------------------------------- 2020-09-25 18:18:54.782727+02 (1 row)
But the lag never goes down to zero:
postgres=# select slot_name, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), restart_lsn)) as restart_lsn_lag, pg_size_pretty(pg_xlog_location_diff(pg_current_xlog_location(), confirmed_flush_lsn)) as confirmed_flush_lsn_lag, restart_lsn, confirmed_flush_lsn from pg_replication_slots; slot_name | restart_lsn_lag | confirmed_flush_lsn_lag | restart_lsn | confirmed_flush_lsn ----------------------------------------+-----------------+-------------------------+--------------+--------------------- patroni_<redacted> | 0 bytes | | 300/B2E9EEE8 | patroni_<redacted> | 0 bytes | | 300/B2E9EEE8 | debezium_test | 912 bytes | 656 bytes | 300/B2E9EB58 | 300/B2E9EC58 (3 rows)
The shutdown mode used is "fast" which means it kills all client connections without waiting for them to disconnect. That also kills the heartbeat connection. But not the wal streaming connection.
The bug happens on:
- Patroni 1.6.5
- PostgreSQL 9.6.19
- Debezium 1.3.0.CR1
- wal2json 2.3-1.pgdg90+1
- Kafka 2.6.0
- Debian 9.13
This issue is similar to DBZ-1727 but it happens on a more recent version.
- relates to
-
DBZ-2685 Support PostgreSQL connector retry when database is restarted
- Closed