Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: None
Affects Version/s: 0.7.5
Component/s: postgres-decoderbufs
Labels:
None

Steps to Reproduce:

Hide

We are not able to controllably reproduce this issue. The issue happens in a live environment. The setup includes a hot standby slave db.

Will report if/when we are able to reproduce.

Relevant links?
https://github.com/debezium/postgres-decoderbufs/pull/6
http://www.postgresql-archive.org/logical-replication-and-PANIC-during-shutdown-checkpoint-in-publisher-td5955983.html

Show
We are not able to controllably reproduce this issue. The issue happens in a live environment. The setup includes a hot standby slave db. Will report if/when we are able to reproduce. Relevant links? https://github.com/debezium/postgres-decoderbufs/pull/6 http://www.postgresql-archive.org/logical-replication-and-PANIC-during-shutdown-checkpoint-in-publisher-td5955983.html

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Our postgres DB is crashed by an Assertion error in a dependency of the decoderbufs library. After over a week of successful operation our db was forced into recovery mode and a restart because of the plugin. We are now not able to move past a certain point in the WAL without crashing the DB.

If we try to peek at the data with:

select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);

It causes the DB to crash.

DB log during crash:

2018-07-25 12:24:39 BST LOG:  starting logical decoding for slot "our_slot_name"
2018-07-25 12:24:39 BST DETAIL:  streaming transactions committing after 137/82C1F4CB, reading WAL from 137/82B14718
2018-07-25 12:24:39 BST STATEMENT:  select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
2018-07-25 12:24:39 BST LOG:  logical decoding found consistent point at 137/82B14718
2018-07-25 12:24:39 BST DETAIL:  Logical decoding will begin using saved snapshot.
2018-07-25 12:24:39 BST STATEMENT:  select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
ocal] SELECT: protobuf-c/protobuf-c.c:643: protobuf_c_message_get_packed_size: Assertion `((message)->descriptor)->magic == 0x28aaeef9' failed.
2018-07-25 12:24:39 BST LOG:  server process (PID 15085) was terminated by signal 6: Aborted
2018-07-25 12:24:39 BST DETAIL:  Failed process was running: select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
2018-07-25 12:24:39 BST LOG:  terminating any other active server processes
2018-07-25 12:24:39 BST WARNING:  terminating connection because of crash of another server process
2018-07-25 12:24:39 BST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.

The crash starts with:

protobuf-c/protobuf-c.c:643: protobuf_c_message_get_packed_size: Assertion `((message)->descriptor)->magic == 0x28aaeef9' failed.

The slot is now stuck on the restart_lsn because the decoder can't continue reading the WAL without crashing the DB each time.

Reading the WAL from the restart_lsn of the replication slot with:

$ pg_xlogdump -p pg_xlog/ --start 137/82B14718 --limit=10

Gives us:

rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 137/82B14718, prev 137/82B14648, desc: RUNNING_XACTS nextXid 105761346 latestCompletedXid 105761344 oldestRunningXid 105761345; 1 xacts: 105761345
rmgr: Transaction len (rec/tot):     46/    46, tx:  105761345, lsn: 137/82B14750, prev 137/82B14718, desc: COMMIT 2018-07-24 17:01:40.610742 BST
rmgr: Heap        len (rec/tot):    157/   157, tx:  105761346, lsn: 137/82B14780, prev 137/82B14750, desc: INSERT off 54, blkref #0: rel 1663/4842851/4846532 blk 316517
rmgr: Btree       len (rec/tot):     64/    64, tx:  105761346, lsn: 137/82B14820, prev 137/82B14780, desc: INSERT_LEAF off 43, blkref #0: rel 1663/4842851/4847601 blk 146583
rmgr: Btree       len (rec/tot):     64/    64, tx:  105761346, lsn: 137/82B14860, prev 137/82B14820, desc: INSERT_LEAF off 2, blkref #0: rel 1663/4842851/4848028 blk 1
rmgr: Gist        len (rec/tot):     66/    66, tx:  105761346, lsn: 137/82B148A0, prev 137/82B14860, desc: PAGE_UPDATE , blkref #0: rel 1663/4842851/4848029 blk 87494
rmgr: Btree       len (rec/tot):     64/    64, tx:  105761346, lsn: 137/82B148E8, prev 137/82B148A0, desc: INSERT_LEAF off 2, blkref #0: rel 1663/4842851/4848030 blk 175834
rmgr: Btree       len (rec/tot):     72/    72, tx:  105761346, lsn: 137/82B14928, prev 137/82B148E8, desc: INSERT_LEAF off 175, blkref #0: rel 1663/4842851/4848031 blk 367635
rmgr: Btree       len (rec/tot):     72/    72, tx:  105761346, lsn: 137/82B14970, prev 137/82B14928, desc: INSERT_LEAF off 175, blkref #0: rel 1663/4842851/4848032 blk 367520
rmgr: Transaction len (rec/tot):     46/    46, tx:  105761346, lsn: 137/82B149B8, prev 137/82B14970, desc: COMMIT 2018-07-24 17:01:40.613454 BST

The problem line (first row) at LSN 137/82B14718 is something to do with RUNNING_XACTS, but I'm not sure if that is the actual cause of the crash.

Assignee:: Unassigned

Reporter:: Julius Syvanen (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2018/07/05 2:49 PM

Updated:: 2018/09/10 7:29 AM

Details

Description

Attachments

Activity

People

Dates