Our current [performance tests](https://github.com/ModeShape/modeshape-performance) do a lot of small, fine-grained operations. While these are somewhat useful, IMO they are too fine-grained to be useful - it makes it really difficult to imagine what kind of performance you'd get in a real application, and that makes the tests relatively useless.
(They have been valuable while developing ModeShape 3.x, since we've been able to use them to see how the performance has changed over the various releases. But this is becoming far less useful, primarily because the earlier 3.x releases were not nearly as stable as the later 3.7 release.)
We should have a benchmark framework where we could easily test a single high-level application scenario in a few different configurations (primarily cache stores). Then we could come up with a few different scenarios.
One scenario might be to store a very large number of customers with UUID-based identifiers (and node names). We've run across this a half-dozen times in the community (we actually do this inside version storage). This would require creating some intermediate-level "hierarchy" nodes to organize the customers by their identifiers, and each "hierarchy" node would ultimately contain a large number of non-SNS children. Although it would be optimal to create the customers ordered by sequential UUIDs, a more interesting scenario would involve bulk-loading the customers randomly.
Another scenario might be to store a very large number of nodes in a more traditional, hierarchical structure. Here, the number of children on any given node would be smaller. (Coming up with this structure might be more difficult, especially in a way that is repeatable. After all, we don't want to just test import.)
Each scenario might have a series of operations that are to be measured and accumulated over time, but these operations are likely to be scenario-specific (time to bulk load 5M customers, time to find all the info about a random customer, etc.) but may also be relatively similar (time to create/clone/delete a subtree of some size, time to issue a series of queries, etc.). The framework just needs to be able to easily measure, aggregate, and report them.
We then need to be able to run each scenario in multiple profiles, each representing a different "configuration" (though we'd likely use profiles mostly for comparing different cache stores).