Using the lucene index provider and looking at performance, I was noticing that the query performance for VALUE indexes on simple string properties was not performing well. I threw together a small test application which inserts a 1000 nodes into a type with an index on a single string field populating the field with a unique sequence of values so that the cardinality should be really high. The test then does a 1000 searches using that field as the constraint with a random value. Test java code, modeshape config, and CND attached. The results were as followed (Modeshape 5.3.0, Windows 7 x64, Java 1.8.0_91) :
Inserted 1000 in 580 ms.
Searched 1000 nodes 1000 times in 57314 ms.
Deleted 1000 in 25 ms.
Throwing this under a profiler, all of the time is spent in IndexReader.document call within ConstantScoreWeightQuery.java. Looking at this code it seems that this query is basically doing a linear search of the index and forcing lucene to instantiate a full document for each entry. Following that logic and digging into the code, I changed the EQUAL_TO case in LuceneQueryFactory.stringFieldQuery from:
return CompareStringQuery.createQueryForNodesWithFieldEqualTo(stringValue, field, factories, caseOperation);
to just using the build in Lucene TermQuery:
return new TermQuery(new Term(field, stringValue));
The results running with this change are:
Inserted 1000 in 627 ms.
Searched 1000 nodes 1000 times in 1327 ms.
Deleted 1000 in 24 ms.
So a 40x improvement, which seems pretty good, and at least from other testing seems to provide correct results.
CompareStringQuery looks like it might be necessary for implementing things like regular expression matching which are not implemented inherently by lucene, but the simple string equality case seems like it should be devolved onto Lucene.