Uploaded image for project: 'Infinispan'
  1. Infinispan
  2. ISPN-1103

Soft schema-based storage

This issue belongs to an archived project. You can view it, but you can't modify it. Learn more

    • Icon: Enhancement Enhancement
    • Resolution: Won't Do
    • Icon: Critical Critical
    • None
    • None
    • Core
    • None
    • Documentation (Ref Guide, User Guide, etc.), Release Notes, Interactive Demo/Tutorial, Compatibility/Configuration
    • High

      This JIRA is about storing metadata alongside values. Perhaps encapsulating values as SchematicValues, which could be described as:

        class SchematicValue {
          String jsonMetadata;
          String jsonObject;
        }
      

      Metadata would allow for a few interesting features:

      • Extracting of lifespan and timestamp data if manipulated over a remote protocol (REST, HotRod, etc)
      • Content type for REST responses
      • Timestamps and SHA-1 hashes, useful for for HTTP headers (e.g., ETag, Cache-control, etc.)
      • Validation information (may not be processed by Infinispan, but can be used by client libs)
      • Classloader/marshaller/classdef version info
      • General structure of the information stored
      • Reference to the schema for this document
      • Storage of older versions

            [ISPN-1103] Soft schema-based storage

            Rejecting, since there is no reason for this anymore.

            Randall Hauch (Inactive) added a comment - Rejecting, since there is no reason for this anymore.

            After discussing this with Manik, we think it's best to push this to the 5.2 release. It simply won't be ready to include in the 5.1 release, and in the mean time it is living and getting thorough usage and testing in the ModeShape repository, where it is serving as the primary store for JCR content for the ModeShape 3.0.

            Therefore, retargeting to the 5.2 release.

            Randall Hauch (Inactive) added a comment - After discussing this with Manik, we think it's best to push this to the 5.2 release. It simply won't be ready to include in the 5.1 release, and in the mean time it is living and getting thorough usage and testing in the ModeShape repository , where it is serving as the primary store for JCR content for the ModeShape 3.0 . Therefore, retargeting to the 5.2 release.

            It's just using Infinispan's Map-Reduce functionality to validate each document using the JSON Schema referenced by the document. It should distribute just fine, because a) validation is dependent only upon the document (the value in Infinispan) itself and the schema registry, which can be serialized.

            The SchematicDb.validateAll() is the method that uses Infinispan Map-Reduce and is here, with the Mapper implementation here and the Reducer implementation here.

            The Reducer spits out validation result objects, which contain the validation errors, warnings, and information messages. That means that the collector will have a Results object for each document key.

            Randall Hauch (Inactive) added a comment - - edited It's just using Infinispan's Map-Reduce functionality to validate each document using the JSON Schema referenced by the document. It should distribute just fine, because a) validation is dependent only upon the document (the value in Infinispan) itself and the schema registry, which can be serialized. The SchematicDb.validateAll() is the method that uses Infinispan Map-Reduce and is here , with the Mapper implementation here and the Reducer implementation here . The Reducer spits out validation result objects, which contain the validation errors, warnings, and information messages. That means that the collector will have a Results object for each document key.

            What do you mean by map-reduce-based validation exactly? I guess this is not distributed at all but rather happening locallt?

            Galder Zamarreño added a comment - What do you mean by map-reduce-based validation exactly? I guess this is not distributed at all but rather happening locallt?

            The design has been evolving, and I've been pushing (overwriting) new versions of the branch. Here's a summary of the basic design:

            The primary goal is to enable storing dynamically-structured values with metadata, and to also enable describing the structure of each value (and metadata) using a schema-based approach. JSON documents provide an excellent way to offer structure that is extremely flexible, while JSON Schema offers a way to define the structure of JSON documents in a way that can be easily validated. (Note that a JSON Schema is just a JSON document that conforms to the JSON meta-schema, which is rich enough to be self-describing. It's actually a very nice specification.)

            Manik originally suggested storing the metadata and value (henceforth referred to as 'content') as strings, but doing so would mean that in order to access any information within the metadata or content, the JSON strings would first need to be parsed into an in-memory representation. Plus, if the content is to be modified, the JSON document would need to be modified and written as a string before being stored. This parsing and writing would become prohibitive.

            Since Infinispan is essentially an large heap of memory, it makes far more sense to represent the content and metadata as in-memory documents, as long as the in-memory representation were compatible with JSON, were easy to use, and could be validated using JSON Schemas. Additionally, if the representation also supported BSON data types (e.g., binary values, UUIDs, dates, regular expressions, etc.), more types of user-content could be supported (including just raw binary data). These in-memory documents could at any time be read from or written to JSON or BSON formats. Having the schematic values be delta-aware with fine-grained locking (see ISPN-1115) would provide significant advantages w/r/t performance and concurrency. (Note that efficient support for delta-aware means that the schematic value can capture the changes made to the documents by client application and use those changes as the delta, rather than having to compare the changed document to a prior version to compute the changes.)

            Using an in-memory representation also means that the content and metadata need not be stored as separate objects, but could instead be represented by a single document that is conceptually:

            {
               "metadata" : {
                  /* metadata as a nested document */
               }
               "content" : /* user's content, as a nested document or binary value */ 
            }
            

            This is the approach taken by the current design. The primary packages are:

            • org.infinispan.schematic
            • org.infinispan.schematic.document
            • org.infinispan.schematic.internal.*

            The first two packages contain the public API, whereas all implementation-specific classes are contained within the "internal" packages.

            The primary API interfaces are:

            • SchematicDb - similar to Cache but tailored to make it easy for users to store a content document (or binary value) with a metadata document. Each SchematicDb has a JSON Schema library, and providing a map-reduce-based validation mechanism. Internally this uses a Cache<String,SchematicEntry>.
            • SchematicEntry - the value actually stored within Infinispan, and which contains a content object (that is a Document or a Binary value) and a metadata Document. There are methods for getting a mutable interface to the content document and metadata documents. Since tracking the MIME type of the content is likely very common, the SchematicEntry interface provides methods for getting and setting the MIME type (which is actually stored in the metadata.
            • Document - an immutable interface to an in-memory document
            • EditableDocument - a mutable interface to an in-memory document
            • Json - utility class for parsing JSON formatted streams/files into Document instances, and for writing Document instances as JSON
            • Bson - utility class for parsing BSON formatted streams/files into Document instances, and for writing Document instances as BSON
            • JsonSchema - utility class for working with JSON Schemas
            • Various interfaces for reprenting JSON/BSON values: Array, Binary, Symbol, Timestamp, Code, CodeWithScope

            The current status is that this works for LOCAL mode, but additional work is required before DISTRIBUTED and REPLICATED modes will work correctly with delta-aware and fine-grained locking.

            As always, feedback is appreciated.

            Randall Hauch (Inactive) added a comment - The design has been evolving, and I've been pushing (overwriting) new versions of the branch. Here's a summary of the basic design: The primary goal is to enable storing dynamically-structured values with metadata, and to also enable describing the structure of each value (and metadata) using a schema-based approach. JSON documents provide an excellent way to offer structure that is extremely flexible, while JSON Schema offers a way to define the structure of JSON documents in a way that can be easily validated. (Note that a JSON Schema is just a JSON document that conforms to the JSON meta-schema, which is rich enough to be self-describing. It's actually a very nice specification.) Manik originally suggested storing the metadata and value (henceforth referred to as 'content') as strings, but doing so would mean that in order to access any information within the metadata or content, the JSON strings would first need to be parsed into an in-memory representation. Plus, if the content is to be modified, the JSON document would need to be modified and written as a string before being stored. This parsing and writing would become prohibitive. Since Infinispan is essentially an large heap of memory, it makes far more sense to represent the content and metadata as in-memory documents , as long as the in-memory representation were compatible with JSON, were easy to use, and could be validated using JSON Schemas. Additionally, if the representation also supported BSON data types (e.g., binary values, UUIDs, dates, regular expressions, etc.), more types of user-content could be supported (including just raw binary data). These in-memory documents could at any time be read from or written to JSON or BSON formats. Having the schematic values be delta-aware with fine-grained locking (see ISPN-1115 ) would provide significant advantages w/r/t performance and concurrency. (Note that efficient support for delta-aware means that the schematic value can capture the changes made to the documents by client application and use those changes as the delta, rather than having to compare the changed document to a prior version to compute the changes.) Using an in-memory representation also means that the content and metadata need not be stored as separate objects, but could instead be represented by a single document that is conceptually: { "metadata" : { /* metadata as a nested document */ } "content" : /* user's content, as a nested document or binary value */ } This is the approach taken by the current design. The primary packages are: org.infinispan.schematic org.infinispan.schematic.document org.infinispan.schematic.internal.* The first two packages contain the public API, whereas all implementation-specific classes are contained within the "internal" packages. The primary API interfaces are: SchematicDb - similar to Cache but tailored to make it easy for users to store a content document (or binary value) with a metadata document. Each SchematicDb has a JSON Schema library, and providing a map-reduce-based validation mechanism. Internally this uses a Cache<String,SchematicEntry>. SchematicEntry - the value actually stored within Infinispan, and which contains a content object (that is a Document or a Binary value) and a metadata Document. There are methods for getting a mutable interface to the content document and metadata documents. Since tracking the MIME type of the content is likely very common, the SchematicEntry interface provides methods for getting and setting the MIME type (which is actually stored in the metadata. Document - an immutable interface to an in-memory document EditableDocument - a mutable interface to an in-memory document Json - utility class for parsing JSON formatted streams/files into Document instances, and for writing Document instances as JSON Bson - utility class for parsing BSON formatted streams/files into Document instances, and for writing Document instances as BSON JsonSchema - utility class for working with JSON Schemas Various interfaces for reprenting JSON/BSON values: Array, Binary, Symbol, Timestamp, Code, CodeWithScope The current status is that this works for LOCAL mode, but additional work is required before DISTRIBUTED and REPLICATED modes will work correctly with delta-aware and fine-grained locking . As always, feedback is appreciated.

            Here's a still-incomplete prototype of a new 'schematic' module, and I'm looking for feedback as to whether this is going in the right direction:

            https://github.com/rhauch/infinispan/tree/ISPN-1103/schematic

            The new 'schematic' module contains a small document-oriented API layer on top of Infinispan and supports storing JSON and BSON documents and using JSON Schema documents to validate the structure of the documents. Rather than store the documents as strings within the Infinispan values, the value consists of a single Document with functionality that's a superset of BSON and JSON, with the goal being very efficient access and modification of the documents. Values are serialized using the BSON binary format representation. Documents are also immutable, with a simple "editable document" mechanism to allow the Infinispan values to "listen" for changes and to support DeltaAware functionality.

            I had originally started to use other JSON and BSON implementations, but was not happy with the APIs. JSON libraries do not natively support binary values and other types, while BSON libraries were less mature. No JSON or BSON libraries had support for JSON Schema, though there is a relatively simplistic and mostly-complete add-on for the Jackson JSON library. Finally, implementing our own also means that we could provide an immutable API for general document access while still making it easy to create and edit documents such that DeltaAware and transaction functionality will work cleanly.

            Perhaps it's useful/desirable to extract the JSON/BSON/JSON Schema functionality into a separate module or even a separate project (as it is likely useful in projects other than Infinispan). If so, please say so. But at this time I'm not worrying about that.

            Randall Hauch (Inactive) added a comment - Here's a still-incomplete prototype of a new 'schematic' module, and I'm looking for feedback as to whether this is going in the right direction: https://github.com/rhauch/infinispan/tree/ISPN-1103/schematic The new 'schematic' module contains a small document-oriented API layer on top of Infinispan and supports storing JSON and BSON documents and using JSON Schema documents to validate the structure of the documents. Rather than store the documents as strings within the Infinispan values, the value consists of a single Document with functionality that's a superset of BSON and JSON, with the goal being very efficient access and modification of the documents. Values are serialized using the BSON binary format representation. Documents are also immutable, with a simple "editable document" mechanism to allow the Infinispan values to "listen" for changes and to support DeltaAware functionality. I had originally started to use other JSON and BSON implementations, but was not happy with the APIs. JSON libraries do not natively support binary values and other types, while BSON libraries were less mature. No JSON or BSON libraries had support for JSON Schema, though there is a relatively simplistic and mostly-complete add-on for the Jackson JSON library. Finally, implementing our own also means that we could provide an immutable API for general document access while still making it easy to create and edit documents such that DeltaAware and transaction functionality will work cleanly. Perhaps it's useful/desirable to extract the JSON/BSON/JSON Schema functionality into a separate module or even a separate project (as it is likely useful in projects other than Infinispan). If so, please say so. But at this time I'm not worrying about that.

            Certainly worth considering.

            Manik Surtani (Inactive) added a comment - Certainly worth considering.

            I completely agree that storing metadata as a JSON structure is a brilliant approach. I've added several features the list in the description.

            Storing the jsonObject as a string might work, especially if it is simply loaded and stored as an atomic unit. However, there are several advantages to storing it in a BSON (or BSON-like) representation, and all boil down to the fact that any schema-aware service will need to access the document contents for indexing, validation, computing differences (for DeltaAware functionality), and even applying changes (e.g., something similar to MongoDB's atomic operations). Using a BSON representation will also make it more natural to handle binary values within the document.

            Now, if we did store the document internally as BSON, we actually don't need to store the metadata separately. For example, we can store a single document with this structure:

              {
                "document" : /* user's document */
                "metadata" : {
                  "schema-ref" : blah-blah
                  ...
                }
            

            If represented as a BSONObject, then the user's document is actually a nested BSON object stored under the "document" property name. This approach means that all the differencing, atomic operations, serialization, etc. functionality will work without having to distinguish between a user's document and a system metadata document. The resulting value class would be:

            class SchematicValue {
                org.bson.BSONObject json;
            }
            

            The downside of doing this is the increase in size of the SchematicValue class, from 2 references to a 1 reference plus an extra BSONObject implementation. The org.bson.BasicBSONObject extends LinkedHashMap<String,Object> (which is fairly substantial), but we could probably provide an optimized implementation for the top-level that still implemented the org.bson.BSONObject interface but more directly. Is it worth the apparent simplicity?

            Randall Hauch (Inactive) added a comment - - edited I completely agree that storing metadata as a JSON structure is a brilliant approach. I've added several features the list in the description. Storing the jsonObject as a string might work, especially if it is simply loaded and stored as an atomic unit. However, there are several advantages to storing it in a BSON (or BSON-like) representation, and all boil down to the fact that any schema-aware service will need to access the document contents for indexing, validation, computing differences (for DeltaAware functionality), and even applying changes (e.g., something similar to MongoDB's atomic operations ). Using a BSON representation will also make it more natural to handle binary values within the document. Now, if we did store the document internally as BSON , we actually don't need to store the metadata separately. For example, we can store a single document with this structure: { "document" : /* user's document */ "metadata" : { "schema-ref" : blah-blah ... } If represented as a BSONObject , then the user's document is actually a nested BSON object stored under the "document" property name. This approach means that all the differencing, atomic operations, serialization, etc. functionality will work without having to distinguish between a user's document and a system metadata document. The resulting value class would be: class SchematicValue { org.bson.BSONObject json; } The downside of doing this is the increase in size of the SchematicValue class, from 2 references to a 1 reference plus an extra BSONObject implementation. The org.bson.BasicBSONObject extends LinkedHashMap<String,Object> (which is fairly substantial), but we could probably provide an optimized implementation for the top-level that still implemented the org.bson.BSONObject interface but more directly. Is it worth the apparent simplicity?

              rhauch Randall Hauch (Inactive)
              manik_jira Manik Surtani (Inactive)
              Archiver:
              rhn-support-adongare Amol Dongare

                Created:
                Updated:
                Resolved:
                Archived: