Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-13008

Infinispan crashes when loading state from RocksDB store

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 12.0.1.Final, 12.1.2.Final
    • None
    • None

      We're using an Infinispan cluster with 3 nodes (2 cache owners) to back our Keycloak instances. In addition, RocksDB is used to persist the cache content for disaster recovery.

      During a rollout of a simple configuration change of Infinispan we ran into issues when Infinispan tried to load the state during startup from a RocksDB cache store. The caches contain roughly 200k entries in total.

      The cluster was in a healthy state, node 1 was stopped gracefully, rebalance between node 2 and 3 worked as expected, node 1 was started again. When loading the cache from the store, a core dump happened:

      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # A fatal error has been detected by the Java Runtime Environment:
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: #
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: #  SIGSEGV (0xb) at pc=0x00007fea6534c140, pid=265898, tid=266045
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: #
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # Problematic frame:
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # C  [libc.so.6+0x1ec140]
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: #
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to //core.265898)
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: #
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # An error report file with more information is saved as:
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: # /tmp/hs_err_pid265898.log
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: Compiled method (nm)   25011 3809     n 0       org.rocksdb.RocksIterator::next0 (native)
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  total in heap  [0x00007fea53ea0a90,0x00007fea53ea0e08] = 888
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  relocation     [0x00007fea53ea0c08,0x00007fea53ea0c38] = 48
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  main code      [0x00007fea53ea0c40,0x00007fea53ea0e00] = 448
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  oops           [0x00007fea53ea0e00,0x00007fea53ea0e08] = 8
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: [thread 265899 also had an error]
      May  4 07:29:33 sandbox-ispn-1 bash[265898]: Compiled method (nm)   25020 3809     n 0       org.rocksdb.RocksIterator::next0 (native)
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  total in heap  [0x00007fea53ea0a90,0x00007fea53ea0e08] = 888
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  relocation     [0x00007fea53ea0c08,0x00007fea53ea0c38] = 48
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  main code      [0x00007fea53ea0c40,0x00007fea53ea0e00] = 448
      May  4 07:29:33 sandbox-ispn-1 bash[265898]:  oops           [0x00007fea53ea0e00,0x00007fea53ea0e08] = 8
      

      Second try ended with a different core dump

      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # A fatal error has been detected by the Java Runtime Environment:
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: #
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: #  SIGSEGV (0xb) at pc=0x00007f3921698cd9, pid=266738, tid=266807
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: #
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # Problematic frame:
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # C  [librocksdbjni2308608421276602065.so+0x4e2cd9]  rocksdb::BlockBasedTableIterator::NextAndGetResult(rocksdb::IterateResult*)+0x19
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: #
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to //core.266738)
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: #
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # An error report file with more information is saved as:
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: # /tmp/hs_err_pid266738.log
      May  4 07:33:58 sandbox-ispn-1 bash[266738]: Compiled method (c2)   21059 4266       4       org.infinispan.protostream.impl.RawProtobufMarshallerDelegate::marshall (17 bytes)
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  total in heap  [0x00007f3939922790,0x00007f3939923130] = 2464
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  relocation     [0x00007f3939922908,0x00007f3939922948] = 64
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  main code      [0x00007f3939922960,0x00007f3939922ce0] = 896
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  stub code      [0x00007f3939922ce0,0x00007f3939922cf8] = 24
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  oops           [0x00007f3939922cf8,0x00007f3939922d08] = 16
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  metadata       [0x00007f3939922d08,0x00007f3939922da8] = 160
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  scopes data    [0x00007f3939922da8,0x00007f3939922fd0] = 552
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  scopes pcs     [0x00007f3939922fd0,0x00007f39399230f0] = 288
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  dependencies   [0x00007f39399230f0,0x00007f39399230f8] = 8
      May  4 07:33:58 sandbox-ispn-1 bash[266738]:  nul chk table  [0x00007f39399230f8,0x00007f3939923130] = 56
      

      As the node was unable to load the data from the RocksDB store we decided to purge the RocksDB directory and let the node fetch the state from the other two nodes. This worked as expected.

      Then we continued with the procedure on node 2. stopped gracefully, rebalance finished, started again and crashed:

      May 04 07:43:50 sandbox-ispn-2 bash[281753]: #
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # A fatal error has been detected by the Java Runtime Environment:
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: #
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: #  SIGSEGV (0xb) at pc=0x000055c73d9bb000, pid=281753, tid=281826
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: #
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # Problematic frame:
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # C  0x000055c73d9bb000
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: #
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to //core.281753)
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: #
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # An error report file with more information is saved as:
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: # /tmp/hs_err_pid281753.log
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: Compiled method (c2)   24823 4411       4       org.infinispan.protostream.impl.RawProtobufMarshallerDelegate::marshall (17 bytes)
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  total in heap  [0x00007f387b45ac90,0x00007f387b45b630] = 2464
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  relocation     [0x00007f387b45ae08,0x00007f387b45ae48] = 64
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  main code      [0x00007f387b45ae60,0x00007f387b45b1e0] = 896
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  stub code      [0x00007f387b45b1e0,0x00007f387b45b1f8] = 24
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  oops           [0x00007f387b45b1f8,0x00007f387b45b208] = 16
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  metadata       [0x00007f387b45b208,0x00007f387b45b2a8] = 160
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  scopes data    [0x00007f387b45b2a8,0x00007f387b45b4d0] = 552
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  scopes pcs     [0x00007f387b45b4d0,0x00007f387b45b5f0] = 288
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  dependencies   [0x00007f387b45b5f0,0x00007f387b45b5f8] = 8
      May 04 07:43:50 sandbox-ispn-2 bash[281753]:  nul chk table  [0x00007f387b45b5f8,0x00007f387b45b630] = 56
      May 04 07:43:50 sandbox-ispn-2 bash[281753]: [thread 281840 also had an error]
      

      Purged the RocksDB, let it fetch state from the others and continued with node 3. Again, no luck.

      May 04 07:37:59 sandbox-ispn-3 bash[265627]: #
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # A fatal error has been detected by the Java Runtime Environment:
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: #
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: #  SIGSEGV (0xb) at pc=0x0000000000000000, pid=265627, tid=265774
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: #
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # JRE version: OpenJDK Runtime Environment (11.0.11+9) (build 11.0.11+9-Ubuntu-0ubuntu2.20.04)
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # Java VM: OpenJDK 64-Bit Server VM (11.0.11+9-Ubuntu-0ubuntu2.20.04, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # Problematic frame:
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # C  0x0000000000000000
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: #
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport %p %s %c %d %P %E" (or dumping to //core.265627)
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: #
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # An error report file with more information is saved as:
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: # /tmp/hs_err_pid265627.log
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: Compiled method (nm)   24948 3844     n 0       org.rocksdb.RocksIterator::next0 (native)
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  total in heap  [0x00007fb9ecbfcf90,0x00007fb9ecbfd308] = 888
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  relocation     [0x00007fb9ecbfd108,0x00007fb9ecbfd138] = 48
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  main code      [0x00007fb9ecbfd140,0x00007fb9ecbfd300] = 448
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  oops           [0x00007fb9ecbfd300,0x00007fb9ecbfd308] = 8
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: Compiled method (nm)   24948 3844     n 0       org.rocksdb.RocksIterator::next0 (native)
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  total in heap  [0x00007fb9ecbfcf90,0x00007fb9ecbfd308] = 888
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  relocation     [0x00007fb9ecbfd108,0x00007fb9ecbfd138] = 48
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  main code      [0x00007fb9ecbfd140,0x00007fb9ecbfd300] = 448
      May 04 07:37:59 sandbox-ispn-3 bash[265627]:  oops           [0x00007fb9ecbfd300,0x00007fb9ecbfd308] = 8
      May 04 07:37:59 sandbox-ispn-3 bash[265627]: [thread 265628 also had an error]
      

      In summary 3 out of 3 nodes have trouble loading the state from the RocksDB and crash. We've tried this a few more times and most of the time Ininispan crashes. It does work in some cases though.

      We noticed that the RocksDB version was updated in Infinispan 12.1.0 as part of https://issues.redhat.com/browse/ISPN-12815. We did not upgrade Infinispan but switched to the updated RocksDB version 6.15.5 as proposed in this ticket. Unfortunately this did not fix the issue.

      Please find attached our Infinispan config and the detailed error reports. Let me know in case more details are required to investigate the problem.

        1. infinispan.xml
          4 kB
        2. hs_err_pid265627.log
          407 kB
        3. hs_err_pid281753.log
          411 kB
        4. hs_err_pid266738.log
          412 kB
        5. hs_err_pid265898.log
          410 kB

              Unassigned Unassigned
              georgpace Georg F (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: