-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
0.7.5
-
None
Our postgres DB is crashed by an Assertion error in a dependency of the decoderbufs library. After over a week of successful operation our db was forced into recovery mode and a restart because of the plugin. We are now not able to move past a certain point in the WAL without crashing the DB.
If we try to peek at the data with:
select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
It causes the DB to crash.
DB log during crash:
2018-07-25 12:24:39 BST LOG: starting logical decoding for slot "our_slot_name" 2018-07-25 12:24:39 BST DETAIL: streaming transactions committing after 137/82C1F4CB, reading WAL from 137/82B14718 2018-07-25 12:24:39 BST STATEMENT: select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL); 2018-07-25 12:24:39 BST LOG: logical decoding found consistent point at 137/82B14718 2018-07-25 12:24:39 BST DETAIL: Logical decoding will begin using saved snapshot. 2018-07-25 12:24:39 BST STATEMENT: select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL); ocal] SELECT: protobuf-c/protobuf-c.c:643: protobuf_c_message_get_packed_size: Assertion `((message)->descriptor)->magic == 0x28aaeef9' failed. 2018-07-25 12:24:39 BST LOG: server process (PID 15085) was terminated by signal 6: Aborted 2018-07-25 12:24:39 BST DETAIL: Failed process was running: select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL); 2018-07-25 12:24:39 BST LOG: terminating any other active server processes 2018-07-25 12:24:39 BST WARNING: terminating connection because of crash of another server process 2018-07-25 12:24:39 BST DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
The crash starts with:
protobuf-c/protobuf-c.c:643: protobuf_c_message_get_packed_size: Assertion `((message)->descriptor)->magic == 0x28aaeef9' failed.
The slot is now stuck on the restart_lsn because the decoder can't continue reading the WAL without crashing the DB each time.
Reading the WAL from the restart_lsn of the replication slot with:
$ pg_xlogdump -p pg_xlog/ --start 137/82B14718 --limit=10
Gives us:
rmgr: Standby len (rec/tot): 54/ 54, tx: 0, lsn: 137/82B14718, prev 137/82B14648, desc: RUNNING_XACTS nextXid 105761346 latestCompletedXid 105761344 oldestRunningXid 105761345; 1 xacts: 105761345 rmgr: Transaction len (rec/tot): 46/ 46, tx: 105761345, lsn: 137/82B14750, prev 137/82B14718, desc: COMMIT 2018-07-24 17:01:40.610742 BST rmgr: Heap len (rec/tot): 157/ 157, tx: 105761346, lsn: 137/82B14780, prev 137/82B14750, desc: INSERT off 54, blkref #0: rel 1663/4842851/4846532 blk 316517 rmgr: Btree len (rec/tot): 64/ 64, tx: 105761346, lsn: 137/82B14820, prev 137/82B14780, desc: INSERT_LEAF off 43, blkref #0: rel 1663/4842851/4847601 blk 146583 rmgr: Btree len (rec/tot): 64/ 64, tx: 105761346, lsn: 137/82B14860, prev 137/82B14820, desc: INSERT_LEAF off 2, blkref #0: rel 1663/4842851/4848028 blk 1 rmgr: Gist len (rec/tot): 66/ 66, tx: 105761346, lsn: 137/82B148A0, prev 137/82B14860, desc: PAGE_UPDATE , blkref #0: rel 1663/4842851/4848029 blk 87494 rmgr: Btree len (rec/tot): 64/ 64, tx: 105761346, lsn: 137/82B148E8, prev 137/82B148A0, desc: INSERT_LEAF off 2, blkref #0: rel 1663/4842851/4848030 blk 175834 rmgr: Btree len (rec/tot): 72/ 72, tx: 105761346, lsn: 137/82B14928, prev 137/82B148E8, desc: INSERT_LEAF off 175, blkref #0: rel 1663/4842851/4848031 blk 367635 rmgr: Btree len (rec/tot): 72/ 72, tx: 105761346, lsn: 137/82B14970, prev 137/82B14928, desc: INSERT_LEAF off 175, blkref #0: rel 1663/4842851/4848032 blk 367520 rmgr: Transaction len (rec/tot): 46/ 46, tx: 105761346, lsn: 137/82B149B8, prev 137/82B14970, desc: COMMIT 2018-07-24 17:01:40.613454 BST
The problem line (first row) at LSN 137/82B14718 is something to do with RUNNING_XACTS, but I'm not sure if that is the actual cause of the crash.