Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-14987

SIGSEGV and memory corruption with persistent cache backed by RocksDB

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 14.0.13.Final, 15.0.0.Final
    • 14.0.1.Final, 14.0.11.Final
    • None
    • None

      We've been running some tests with a 12-node InfiniSpan cluster and at some point we've hit an issue which causes the nodes of the cluster to crash on startup or when querying it.

      I've been trying to simplify the issue as much as possible and I've arrived at something manageable. The setup is fairly simple:

      • Start with a single InfiniSpan node with a single persistent cache backed by RocksDB
      • Fill it with tens of thousands of entries
      • Restart the node

      At this point InfiniSpan will sometimes crash on startup. The probability of that happening seems to depend on something - I've seen the crash happening almost everytime with certain persisted data and only sometimes with others.

      However, if it doesn't crash on startup, then it will probably crash when queried over the REST API like this:

      http://localhost:11222/rest/v2/caches/<cache>?action=keys&limit=100&batch=512

      Though sometimes it needs more than one attempt.

      Note that it doesn't seem to crash when batch is lower than 512. It also doesn't seem to crash if I don't include the limit parameter.

      I'll now provide some more details to the individual steps.

      Single node backed by RocksDB

      In my tests I've been running InfiniSpan through Docker mostly like this:

      docker run -it -e USER=admin -e PASS=admin -p 11222:11222 -v /.../myconfig:/user-config -v /.../rocksdbjni-7.1.2.jar:/opt/infinispan/lib/rocksdbjni-7.1.2.jar infinispan/server:14.0.11.Final -c /user-config/conf.xml
      

      The /.../myconfig path contains the conf.xml file that I've attached. rocksdbjni-7.1.2.jar is simply the RocksDB fat jar.

      The entries

      I am not entirely sure if anything about the entries matters, so I'm just going with what I know works - the key being random 44 bytes encoded as a hex string (I generate this with openssl rand -hex 44 | tr a-z A-Z) and the value being JSON like this:

      {"Key":"fg+Dqc/jtQdS2Wpk9Pa5XeJD9WbjfW6cpxkNlUPDo9w=","Type":"refresh_token","SubjectId":"58014","SessionId":null,"ClientId":"spooler","Description":null,"CreationTime":"2023-06-22T12:24:34Z","Expiration":"2023-07-22T12:24:34Z","ConsumedTime":null,"Data":"{\"CreationTime\":\"2023-06-22T12:24:34Z\",\"Lifetime\":2592000,\"ConsumedTime\":null,\"AccessToken\":{\"AllowedSigningAlgorithms\":[],\"Confirmation\":null,\"Audiences\":[\"jobServiceApiAudience\"],\"Issuer\":\"http://localhost:5001\",\"CreationTime\":\"2023-06-22T12:24:34Z\",\"Lifetime\":2592000,\"Type\":\"access_token\",\"ClientId\":\"spooler\",\"AccessTokenType\":0,\"Description\":null,\"Claims\":[{\"Type\":\"client_id\",\"Value\":\"spooler\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"commands:create\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"jobs:create\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"jobs:delete\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"jobs:read\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"openid\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"print_session:create\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"profile\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"scope\",\"Value\":\"offline_access\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"sub\",\"Value\":\"67021\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"auth_time\",\"Value\":\"6554245419\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#integer64\"},{\"Type\":\"idp\",\"Value\":\"local\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"amr\",\"Value\":\"password\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"name\",\"Value\":\"Name89175 Surname65047\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"given_name\",\"Value\":\"Name99831\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"family_name\",\"Value\":\"Surname49224\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"username\",\"Value\":\"user21440\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"preferred_username\",\"Value\":\"user21440\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"UserId\",\"Value\":\"55734\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"jti\",\"Value\":\"0c7b5967dc98481d4cf742a9fb7d9b88\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#string\"},{\"Type\":\"iat\",\"Value\":\"6687422175\",\"ValueType\":\"http://www.w3.org/2001/XMLSchema#integer64\"}]}}"}
      

      I generate random variations of this value using a simple fish script & jq (I've attached those). I then insert a bunch of these into InfiniSpan by running this in a loop:

      ./make-value.fish | curl http://localhost:11222/rest/v2/caches/PersistedGrant/(openssl rand -hex 44 | tr a-z A-Z) -X POST --data @- -Hcontent-type:application/json
      

      I don't know if the number of entries matters, but I've tried replicating the issue with different numbers from about 30k to about 100k and I've been successful everytime.

      The crashes

      It seems that InfiniSpan is crashing somewhere within RocksDB. There's a couple of different ways it crashes - I've seen these ways:

      • SIGSEGV
      • double free or corruption (!prev) - I believe that this one comes from glibc from free
      • corrupted size vs. prev_size - glibc as well, malloc or free
      • pure virtual method called - C++ complaining about pure virtual functions being called (which I think can happen in a constructor/destructor)

      Not sure if there are any others that I've missed. I've attached a couple of InfiniSpan outputs and crash reports that show these crashes.

      Other notes

      I've tried writing simple C++ and Java programs that iterate through the RocksDB database to try to see if that has any issues (perhaps the database could be somehow corrupted), but that did not yield any results.

      I've tried doing the same thing with the latest rocksdbjni and I observed the same behavior.

      This might be slightly related to ISPN-12997 and ISPN-13008, though that was two major versions ago.

      So, any ideas what I could do to debug this further? Any ideas what could be the issue and how to mitigate it?

        1. conf.xml
          2 kB
        2. crashes.tar.gz
          95 kB
        3. make-value.fish
          0.8 kB
        4. make-value.jq
          4 kB

              rh-ee-jbolina Jose Bolina
              zdenek.biberle Zdeněk Biberle (Inactive)
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: