ModeShape 3.x maintains a single Lucene index for each repository, but there are several issues with this:
- In a clustered deployment, maintaining and synchronizing these different Lucene indexes is quite difficult, error prone, complex, and inconsistent (since the master content is transferred to the slaves only periodically).
- The Lucene index is used for querying all fields, even when the values and criteria don't involve search (e.g., numeric fields, exact text matches, pattern matches, etc.). Lucene is not ideal for these kinds of queries, whereas traditional indexes (e.g., based upon B*-tree or similar) would be far more efficient and effective.
- Using a single Lucene index for a whole repository is far from ideal, and it leads to concurrency problems (even in a local, non-clustered case).
- When nodes are changed, the whole document in Lucene must be updated. This means we can't really update the Lucene index with just what's changed, and thus updating the index requires accessing the node rather than just working from the events.
- Using Lucene (and Hibernate Search) adds a number of dependencies and complicates the build process, especially for the EAP kit.
- We're currently indexing all properties. Doing so does mean that users can use any properties in their criteria, but it also means that the indexes are large and updating/replicating them takes longer. Ideally, we can offer the ability to index only specific properties that are actually used in query criteria. Doing this with Lucene would be difficult.
- When a process in a cluster leaves the cluster (e.g., is taken down) and then (re)joins the cluster, ModeShape has no option other than to completely reindex the content (or, if master-slave is used copy the indexes, though this copying leads to other inconsistencies).
The objective of this feature is to replace the query engine with one that can use explicitly-defined indexes defined by administrators. The query engine should even work when no indexes are defined, though it will be slower (potentially a lot slower) than if proper indexes are defined for a query. And like a regular relational database, which indexes you define will depend heavily on the queries you are using.
Additionally, indexes should be able to be stored/accessed using several "index provider" mechanisms, including:
- "internal" indexes (e.g., local files via MapDB; see
- local file-system-based indexes using Lucene
- indexes in Solr
- indexes in ElasticSearch
Using explicitly-defined indexes would perform a lot better as we'd only be indexing the information that needs to be indexed rather than all of the content, as we do with 3.x. Plus this will make clustering easier, since it (along with the journal service) make it far easier to bring a process up and update the indexes after a process has been out of the cluster for a period of time.