We can easily scale VRP to 10k locations (with nearby selection etc), but to scale VRP to 100k locations, there are a number of problems: limitedSelection cause result degradation, nearbySelection consumes too much RAM (we're working on that in
PLANNER-485), etc. Especially some CH implementation become slow. Another approach is Partitioned Search, which is used a lot in the real world and I 've got a POC on kaggle's santas-stolen-sleigh that shows it works well actually.
But let me be clear: using Partitioned Search is "a sign of weakness", a sign that we're scaling beyond our state of the art solver's capabilities. It limits potential result quality. See this blog for an nice example:
Our competitors need to partition at 400+ locations. We need to partition at 10k+ locations. We should always aim for increasing our partition size limit even more.
- Concurrently solve each partition with multiple threads, but this is not the real multi-threaded solving of
There are 3 ways to do multi threaded solving.
- Bet on multiple horses that run separately. Ignore all but the best horse.
- Bet on 1 carriage with multiple horses (
- Bet on 1 superhorse (
PLANNER-76): real multi-threaded solving
Proposal A strawman:
- Must handle clean partitioning cases such as CVRP (each partition gets a distinct subset of vehicles and customers) and merging pain cases such as TSP (merging 2 TSP partitions needs domain specific code).
- Termination now has an extra layer: global solver, partition solver, phase. For example:
- We have 10 partitions that each take 3 minutes. So the solver will stop after 30 minutes
- Or, the global solver has 20 minutes to solve. What if we have 10 partitions that all terminate after 3 minutes? The last partitions don't get run...
- Termination should support automatically spreading time across partitions, taking local thread count into account.
- Thread count can be automatic based on number of CPU's, like in benchmarker
- ExecutorService needs to be injectable. And shareable with other subsystems.
- indent logging of subsolvers
- Number of partitions needs to be dynamic and depend on the dataset size, for example it's simply the List<Solution>.size() after splitting.
- A fail fast when merging is due to the nature of the domain impossible would be nice. OTH, it's never impossible, it's just that merging n feasible solutions might become 1 unfeasible solution.
- The partitionedSearchPhase must still send out best solution changed events: Every time a partition solver finds a new best solution, the partitionedSearchPhase's working solution is adjusted accordingly, which might trigger a best solution changed event (and usually does if the all partitions are initialized).
- To make this happen easily, a custom Partitioner implementation needs to register partition clones... (internally we can use the surrogate key subsystem which we need to build)
- The number of threads must be supplied to the Partitioner, so it can use that information to decide the number of partitions.
- Also support "round robin random subset as partition". Instead of splitting up 100k process into 100 partitions of 1k processes and solving them 4 at a time (4 is number of threads), just grab a random 1k unique processes 4 times and solve them. If 1 partition no longer improves for s steps, return it to the pool and grab another 1k random unique processes (not already being solved by any of the other 3 threads of course).
- Support out of the box partition schemes through reflection. For example for chained variables when solution initialized: each partition get x unique chains. For example for cloud balancing: each partition get x unique processes and y unique computers.
- partitionKey: The out-of-the-box Partitioner implementations need to be able to do "group by", based on entity.getMyPKey() if partitionKey="myPKey". For example, in machine reassignment, the partitionKey might be process.getService(). For course scheduling, the teachers, nor the curricula should be divided across partitions, so we might need to have multiple partitionKey's and we need to find a better name.
- immovable entities are treated specially by the out-of-the-box Partitioners:
- if their entity and planning values all belong to 1 partition, nothing special happens.
- if their entity or planning values belong to multiple partitions, they get partitioned cloned in all those partitions.
- immovable entities do not influence the partition sizes: they don't up to the partition count limit etc
- When multiple child phases are configured, the partitioned search should first run the first phase (probably CH) on all parts before running the second phase on all parts. So 10 parts with 2 phases each actually become 20 jobs.
- The parts aren't actually run by a separate Solver instance: PS reuses the phase code (CH and LS) - it doesn't need to recreate an entire Solver.
- Time (termination) vs cpu count etc
- When partListSize <= availableProcessors (or the amount given), all parts just run in parallel
- When partListSize <= threadPoolSize, all parts just run. Even if threadPoolSize > availableProcessors - let the JVM round robin thread stuff do it's job
- When partListSize > threadPoolSize, things fall apart. For example 10 parts on 4 threads (on 4 cores): for part 5 to run, part 1 needs to be set on hold.
- If the phaseList contains both CH and LS, we should split those up: by running all CH's first, so we deliver an initial solution faster (think terminateEarly(), no-termination solving, etc).
- Continuations in Java are a potential pain for distribution, installation etc, so that's a no go
- So a Phase (especially LS and ES) need to be able solveABitOfTime(), and we just keep calling those in a round robin fashion until the time is up.
- So how much is a bit of time? That's Termination specific...
- Solution proposal A) Just take the termination time and divide by partListSize, but not multiplied by threadPoolSize or cpuSize. Just rerun them as long as there's still time.
- Solution proposal B). Run partition 1 through 10 for 1 step. Then for 2 more steps. Then for 4 more steps. Then for 8 more steps.
- Always stop immediately if global time is up of course.
- Even in this approach, running with no-termination in combination with terminateEarly() should be possible.
- The docs should clearly mention that DEBUG logging is a point of congestion in multitenant and partitioned search.
- The benchmarker should report something around that.