Uploaded image for project: 'Debezium'
  1. Debezium
  2. DBZ-790

Assert Error in protobuf-c causes DB crash

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Critical Critical
    • None
    • 0.7.5
    • postgres-decoderbufs
    • None

      Our postgres DB is crashed by an Assertion error in a dependency of the decoderbufs library. After over a week of successful operation our db was forced into recovery mode and a restart because of the plugin. We are now not able to move past a certain point in the WAL without crashing the DB.

      If we try to peek at the data with:

      select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
      

      It causes the DB to crash.

      DB log during crash:

      2018-07-25 12:24:39 BST LOG:  starting logical decoding for slot "our_slot_name"
      2018-07-25 12:24:39 BST DETAIL:  streaming transactions committing after 137/82C1F4CB, reading WAL from 137/82B14718
      2018-07-25 12:24:39 BST STATEMENT:  select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
      2018-07-25 12:24:39 BST LOG:  logical decoding found consistent point at 137/82B14718
      2018-07-25 12:24:39 BST DETAIL:  Logical decoding will begin using saved snapshot.
      2018-07-25 12:24:39 BST STATEMENT:  select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
      ocal] SELECT: protobuf-c/protobuf-c.c:643: protobuf_c_message_get_packed_size: Assertion `((message)->descriptor)->magic == 0x28aaeef9' failed.
      2018-07-25 12:24:39 BST LOG:  server process (PID 15085) was terminated by signal 6: Aborted
      2018-07-25 12:24:39 BST DETAIL:  Failed process was running: select data from pg_logical_slot_peek_binary_changes('our_slot_name',NULL,NULL);
      2018-07-25 12:24:39 BST LOG:  terminating any other active server processes
      2018-07-25 12:24:39 BST WARNING:  terminating connection because of crash of another server process
      2018-07-25 12:24:39 BST DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
      

      The crash starts with:

      protobuf-c/protobuf-c.c:643: protobuf_c_message_get_packed_size: Assertion `((message)->descriptor)->magic == 0x28aaeef9' failed.
      

      The slot is now stuck on the restart_lsn because the decoder can't continue reading the WAL without crashing the DB each time.

      Reading the WAL from the restart_lsn of the replication slot with:

      $ pg_xlogdump -p pg_xlog/ --start 137/82B14718 --limit=10
      

      Gives us:

      rmgr: Standby     len (rec/tot):     54/    54, tx:          0, lsn: 137/82B14718, prev 137/82B14648, desc: RUNNING_XACTS nextXid 105761346 latestCompletedXid 105761344 oldestRunningXid 105761345; 1 xacts: 105761345
      rmgr: Transaction len (rec/tot):     46/    46, tx:  105761345, lsn: 137/82B14750, prev 137/82B14718, desc: COMMIT 2018-07-24 17:01:40.610742 BST
      rmgr: Heap        len (rec/tot):    157/   157, tx:  105761346, lsn: 137/82B14780, prev 137/82B14750, desc: INSERT off 54, blkref #0: rel 1663/4842851/4846532 blk 316517
      rmgr: Btree       len (rec/tot):     64/    64, tx:  105761346, lsn: 137/82B14820, prev 137/82B14780, desc: INSERT_LEAF off 43, blkref #0: rel 1663/4842851/4847601 blk 146583
      rmgr: Btree       len (rec/tot):     64/    64, tx:  105761346, lsn: 137/82B14860, prev 137/82B14820, desc: INSERT_LEAF off 2, blkref #0: rel 1663/4842851/4848028 blk 1
      rmgr: Gist        len (rec/tot):     66/    66, tx:  105761346, lsn: 137/82B148A0, prev 137/82B14860, desc: PAGE_UPDATE , blkref #0: rel 1663/4842851/4848029 blk 87494
      rmgr: Btree       len (rec/tot):     64/    64, tx:  105761346, lsn: 137/82B148E8, prev 137/82B148A0, desc: INSERT_LEAF off 2, blkref #0: rel 1663/4842851/4848030 blk 175834
      rmgr: Btree       len (rec/tot):     72/    72, tx:  105761346, lsn: 137/82B14928, prev 137/82B148E8, desc: INSERT_LEAF off 175, blkref #0: rel 1663/4842851/4848031 blk 367635
      rmgr: Btree       len (rec/tot):     72/    72, tx:  105761346, lsn: 137/82B14970, prev 137/82B14928, desc: INSERT_LEAF off 175, blkref #0: rel 1663/4842851/4848032 blk 367520
      rmgr: Transaction len (rec/tot):     46/    46, tx:  105761346, lsn: 137/82B149B8, prev 137/82B14970, desc: COMMIT 2018-07-24 17:01:40.613454 BST
      

      The problem line (first row) at LSN 137/82B14718 is something to do with RUNNING_XACTS, but I'm not sure if that is the actual cause of the crash.

            Unassigned Unassigned
            jsyvanen Julius Syvanen (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: