-
Feature Request
-
Resolution: Done
-
Major
-
None
-
None
-
2018 Week 09-10, 2018 Week 11-12, 2018 Week 13-14
-
NEW
-
NEW
Reqs:
- The (LocalSearch)Decider.decideNextStep() will offload work to move threads.
- Each move will only be done (and undone) on one of the move threads. The solver thread won't do that any more.
- The winning move (= step) will be done on the solver thread and every move thread (to keep the working solution state in sync)
- The move thread will do the move, calculate the score (= biggest chunk of work), decide acceptance and undo the move
- The solver thread will still do the forager work
- REFACTOR IMPACT: the forager will receive the moveScope after it is undone (in 7.6 it receives it before it is undone)
- OPEN QUESTION: What about the selector that generates the move? In the solver thread or in the move thread?
- Reproducibility
- We cannot sacrifice reproducibility for multithreaded solving. That's out of the question.
- But we are allowed to take shortcuts if environmentMode is non-reproducbility. But this is is an edge case!
- With 4 move threads, we calculate the score for 4 moves in parallel, that's the whole point of going multithreaded.
- So the selector calls nextRandom() to generate 4 moves before the first move's result is returned.
- If the 2nd move is picked, the nextRandom() was already called for move 3 and move 4. There is no escaping that!
- If there are only 2 move threads, those 2 nextRandom() calls wouldn't have been called and the Random would be in a different state. That kills reproducibility.
- Therefore reproducibility is limited to when using the same number of move threads.
- Move threads are likely to default to `cpuSize - 2`.
- So reproducibility is platform independent, but the info log will report the move thread count.
- But multithreaded solving won't be enabled by default anyway.
- It would be nice to specify the number of move threads, well above the number cpu's, to reproduce an issue of 64-core machine on an 8-core machine?
- Move threads are likely to default to `cpuSize - 2`.
- Guiding principle: if a move might have been generated, it must be generated even if the winning step is already known.
- Random
- Some acceptors use Random. For example: Simulated Annealing. Acceptor.isAccepted() is done in the move thread.
- Using the same Random instance concurrently across multiple threads kills reproducibility because there is no way to control the order.
- So each move thread will get its own random, seeded by the solver thread's random.
- QA question: does this hurt "uniform distribution"? With a simple monte-carlo simulation we can answer that question. Or we can ask on math/CS's stackoverflow.
- The alternative is to do acceptance in the solver thread, which has 2 disadvantages: less parallelization and the move is already undone before isAcceptance. The latter is a no-go.
- The solver thread and move threads communicate by sending move wrappers (potentially MoveScope instances) back and forth.
- With Move.rebase() we can easily rebase those to the new thread's workingSolution.
- This leads us into producers/consumers, queues, etc. How do we design this?
- We cannot sacrifice reproducibility for multithreaded solving. That's out of the question.
Proposal A: 1 moveQueue and 1 scoreQueue on the solver thread
- The solver thread generates the moves.
- It controls how many unforaged moves are in created parallel, usually `moveThreadCount * 3`. => Reproducible.
- The generated moves go into the global moveQueue. The move threads take from that.
- WEAKNESS: move threads hit that one moveQueue concurrently
- It fills up the queue to the "unforagedMoveMaximum".
- For example, with 4 move threads (on a 6 core machine, but that doesn't matter) and the default buffer size of 3, there would be `3*4 = 12` unforaged moves in play.
- In a loop, the move threads take a move from the moveQueue, process it and put the result on the scoreQueue.
- If the moveQueue blocks, they wait. So they are effectively bounded by that unforagedMoveMaximum. (this is a good thing)
- The scoreQueue is unbounded, because it is bounded by that unforagedMoveMaximum implicitly due to the behaviour below
- In a loop, the solver thread continuously takes 1 processed move from the scoreQueue, forages it, and generates a new unprocessed move in the moveQueue.
- If the winning step is chosen, it stops after foraging.
- The number of generated moves is stable. => Reproducible
- STRENGTH: if 1 move thread is very slow, the other moves threads can process twice as many moves, without blocking.
- In some cloud containers, we get 2.5 CPU's. This means we get 2 full cpu's and 50% time for another CPU.
- If the winning step is chosen, it stops after foraging.
- The scoreQueue needs to sort itself based on the moveIndex, so that it outputs moves in the same order as they were generated.
- WEAKNESS: if a slow move still isn't calculated by the time that all the other move threads calculated all other 11 moves after that move, those move threads will block on moveQueue.take()
- Workaround: increase buffer size. When playing with ruinAndRecreate, it might be wise to have a large buffer size of 200.
- There is no real harm in this, besides potentially doing 199 discarded moves when Threads could have been blocking instead.
- Workaround: increase buffer size. When playing with ruinAndRecreate, it might be wise to have a large buffer size of 200.
- WEAKNESS: if a slow move still isn't calculated by the time that all the other move threads calculated all other 11 moves after that move, those move threads will block on moveQueue.take()
Proposal B: every move thread has a moveQueue and a scoreQueue
- The solver thread generates the moves.
- It controls how many unforaged moves are created in parallel, usually `moveThreadCount * 3`. => Reproducible.
- The generated moves go round-robin into the moveThreads' moveQueues.
- So with 4 moveThreads, each moveQueue gets 3 moves, so again 12 unforaged moves in play.
- STRENGTH: Very little concurrency on the moveQueue.
- But still the same concurrency on the scoreQueue (which is unavoidable really - at least without heavy forager refactors, which is out of scope for now)
- In a loop, the solver thread continuously goes to the next move thread, takes 1 processed move from its scoreQueue, forages it, and generates a new unprocessed move in its moveQueue.
- WEAKNESS: if 1 move thread is very slow, the other moves threads will block on the scoreQueue.
- The slowest move thread effectively sets the pace: all other move threads become equally slow.
- THREAT: a ruinAndRecreate move can be a 100 times slower than a simple ChangeMove.
- WEAKNESS: if 1 move thread is very slow, the other moves threads will block on the scoreQueue.
Proposal C: every move thread has its own selector and a scoreQueue There is no moveQueue.
- Each move thread generates its own moves.
- Its scoreQueue blocks if it would go more than 3 moves in advance, to control the number of moves created in a parallel?
- Is this enough to guarantee reproducibility?
- STRENGTH: the selector move generation is also parallel
- WEAKNESS: in IndictmentLocalSearch (prototype not yet merged), the moveThreads must also enable ConstraintMatch tracking.
- WEAKNESS/THREAT: slowest move thread sets the pace (see proposal B)
- Its scoreQueue blocks if it would go more than 3 moves in advance, to control the number of moves created in a parallel?
Proposal D: proposal B with work stealing from the moveQueues, but one global scoreQueue with sorting like in proposal A
Proposal E: proposal B steal work, but give it back from where you stole it, with sorting there locally
- Work stealing
- Every move thread first poll()'s its own queue. If that is empty, it steals from other queues.
- If all queues are empty, it takes()'s from it's own queue, blocking if it is still empty. This is possible if foragering is slow.
- The thread to work steal from, is i + 1, i + 3, i + 7, i + 15, etc. ( 1 + 2 = 3, 3 + 4 = 7, 7 + 8 = 15). Modulus queue size of course.
- This means that if move thread 10, 11 and 12 are empty, that:
- 10 will look to 11 (empty), then 14 (non-empty)
- 11 will look to 12 (empty), then 15 (non-empty)
- 12 will look to 13 (non-empty)
- This means that if move thread 10, 11 and 12 are empty, that:
- Every move thread first poll()'s its own queue. If that is empty, it steals from other queues.
- Giving it back
- After the stolen move is calculated, it is given back to the original move thread's scoreQueue.
- That scoreQueue's output is sorted on moveIndex.
- is incorporated by
-
PLANNER-76 Multi-threaded incremental Solver (without partitioning): construction heuristics, local search, ...
- Resolved