Loading...

Type: Feature Request
Resolution: Done
Priority: Major
Fix Version/s: 7.8.0.Final
Affects Version/s: None
Component/s: optaplanner-core
Labels:
None

Epic Link:
Multi-threaded solving
Sprint:
2018 Week 09-10, 2018 Week 11-12, 2018 Week 13-14
Docs QE Status:
NEW
QE Status:
NEW

Reqs:

The (LocalSearch)Decider.decideNextStep() will offload work to move threads.
- Each move will only be done (and undone) on one of the move threads. The solver thread won't do that any more.
- The winning move (= step) will be done on the solver thread and every move thread (to keep the working solution state in sync)
- The move thread will do the move, calculate the score (= biggest chunk of work), decide acceptance and undo the move
- The solver thread will still do the forager work
  - REFACTOR IMPACT: the forager will receive the moveScope after it is undone (in 7.6 it receives it before it is undone)
- OPEN QUESTION: What about the selector that generates the move? In the solver thread or in the move thread?
Reproducibility
- We cannot sacrifice reproducibility for multithreaded solving. That's out of the question.
  - But we are allowed to take shortcuts if environmentMode is non-reproducbility. But this is is an edge case!
- With 4 move threads, we calculate the score for 4 moves in parallel, that's the whole point of going multithreaded.
  - So the selector calls nextRandom() to generate 4 moves before the first move's result is returned.
  - If the 2nd move is picked, the nextRandom() was already called for move 3 and move 4. There is no escaping that!
    - If there are only 2 move threads, those 2 nextRandom() calls wouldn't have been called and the Random would be in a different state. That kills reproducibility.
  - Therefore reproducibility is limited to when using the same number of move threads.
    - Move threads are likely to default to `cpuSize - 2`.
      - So reproducibility is platform independent, but the info log will report the move thread count.
      - But multithreaded solving won't be enabled by default anyway.
    - It would be nice to specify the number of move threads, well above the number cpu's, to reproduce an issue of 64-core machine on an 8-core machine?
  - Guiding principle: if a move might have been generated, it must be generated even if the winning step is already known.
- Random
  - Some acceptors use Random. For example: Simulated Annealing. Acceptor.isAccepted() is done in the move thread.
  - Using the same Random instance concurrently across multiple threads kills reproducibility because there is no way to control the order.
  - So each move thread will get its own random, seeded by the solver thread's random.
    - QA question: does this hurt "uniform distribution"? With a simple monte-carlo simulation we can answer that question. Or we can ask on math/CS's stackoverflow.
    - The alternative is to do acceptance in the solver thread, which has 2 disadvantages: less parallelization and the move is already undone before isAcceptance. The latter is a no-go.
- The solver thread and move threads communicate by sending move wrappers (potentially MoveScope instances) back and forth.
  - With Move.rebase() we can easily rebase those to the new thread's workingSolution.
  - This leads us into producers/consumers, queues, etc. How do we design this?

Proposal A: 1 moveQueue and 1 scoreQueue on the solver thread

The solver thread generates the moves.
- It controls how many unforaged moves are in created parallel, usually `moveThreadCount * 3`. => Reproducible.
The generated moves go into the global moveQueue. The move threads take from that.
- WEAKNESS: move threads hit that one moveQueue concurrently
- It fills up the queue to the "unforagedMoveMaximum".
  - For example, with 4 move threads (on a 6 core machine, but that doesn't matter) and the default buffer size of 3, there would be `3*4 = 12` unforaged moves in play.
In a loop, the move threads take a move from the moveQueue, process it and put the result on the scoreQueue.
- If the moveQueue blocks, they wait. So they are effectively bounded by that unforagedMoveMaximum. (this is a good thing)
- The scoreQueue is unbounded, because it is bounded by that unforagedMoveMaximum implicitly due to the behaviour below
In a loop, the solver thread continuously takes 1 processed move from the scoreQueue, forages it, and generates a new unprocessed move in the moveQueue.
- If the winning step is chosen, it stops after foraging.
  - The number of generated moves is stable. => Reproducible
- STRENGTH: if 1 move thread is very slow, the other moves threads can process twice as many moves, without blocking.
  - In some cloud containers, we get 2.5 CPU's. This means we get 2 full cpu's and 50% time for another CPU.
The scoreQueue needs to sort itself based on the moveIndex, so that it outputs moves in the same order as they were generated.
- WEAKNESS: if a slow move still isn't calculated by the time that all the other move threads calculated all other 11 moves after that move, those move threads will block on moveQueue.take()
  - Workaround: increase buffer size. When playing with ruinAndRecreate, it might be wise to have a large buffer size of 200.
    - There is no real harm in this, besides potentially doing 199 discarded moves when Threads could have been blocking instead.

Proposal B: every move thread has a moveQueue and a scoreQueue

The solver thread generates the moves.
- It controls how many unforaged moves are created in parallel, usually `moveThreadCount * 3`. => Reproducible.
The generated moves go round-robin into the moveThreads' moveQueues.
- So with 4 moveThreads, each moveQueue gets 3 moves, so again 12 unforaged moves in play.
- STRENGTH: Very little concurrency on the moveQueue.
  - But still the same concurrency on the scoreQueue (which is unavoidable really - at least without heavy forager refactors, which is out of scope for now)
In a loop, the solver thread continuously goes to the next move thread, takes 1 processed move from its scoreQueue, forages it, and generates a new unprocessed move in its moveQueue.
- WEAKNESS: if 1 move thread is very slow, the other moves threads will block on the scoreQueue.
  - The slowest move thread effectively sets the pace: all other move threads become equally slow.
  - THREAT: a ruinAndRecreate move can be a 100 times slower than a simple ChangeMove.

Proposal C: every move thread has its own selector and a scoreQueue There is no moveQueue.

Each move thread generates its own moves.
- Its scoreQueue blocks if it would go more than 3 moves in advance, to control the number of moves created in a parallel?
  - Is this enough to guarantee reproducibility?
- STRENGTH: the selector move generation is also parallel
- WEAKNESS: in IndictmentLocalSearch (prototype not yet merged), the moveThreads must also enable ConstraintMatch tracking.
- WEAKNESS/THREAT: slowest move thread sets the pace (see proposal B)

Proposal D: proposal B with work stealing from the moveQueues, but one global scoreQueue with sorting like in proposal A

Proposal E: proposal B steal work, but give it back from where you stole it, with sorting there locally

Work stealing
- Every move thread first poll()'s its own queue. If that is empty, it steals from other queues.
  - If all queues are empty, it takes()'s from it's own queue, blocking if it is still empty. This is possible if foragering is slow.
  - The thread to work steal from, is i + 1, i + 3, i + 7, i + 15, etc. ( 1 + 2 = 3, 3 + 4 = 7, 7 + 8 = 15). Modulus queue size of course.
    - This means that if move thread 10, 11 and 12 are empty, that:
      - 10 will look to 11 (empty), then 14 (non-empty)
      - 11 will look to 12 (empty), then 15 (non-empty)
      - 12 will look to 13 (non-empty)
Giving it back
- After the stolen move is calculated, it is given back to the original move thread's scoreQueue.
- That scoreQueue's output is sorted on moveIndex.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Selection_641.png
31 kB
2018/02/27 11:10 PM
Selection_640.png
33 kB
2018/02/27 11:03 PM
ABbenchmarkResults.ods
67 kB
2018/02/28 9:12 AM

is incorporated by

PLANNER-76 Multi-threaded incremental Solver (without partitioning): construction heuristics, local search, ...

Resolved

Details

Description

Attachments

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates