Uploaded image for project: 'OptaPlanner'
  1. OptaPlanner
  2. PLANNER-1025

Multithreaded solving: queue design for inter-thread communication

XMLWordPrintable

    • Icon: Feature Request Feature Request
    • Resolution: Done
    • Icon: Major Major
    • 7.8.0.Final
    • None
    • optaplanner-core
    • None
    • 2018 Week 09-10, 2018 Week 11-12, 2018 Week 13-14
    • NEW
    • NEW

      Reqs:

      • The (LocalSearch)Decider.decideNextStep() will offload work to move threads.
        • Each move will only be done (and undone) on one of the move threads. The solver thread won't do that any more.
        • The winning move (= step) will be done on the solver thread and every move thread (to keep the working solution state in sync)
        • The move thread will do the move, calculate the score (= biggest chunk of work), decide acceptance and undo the move
        • The solver thread will still do the forager work
          • REFACTOR IMPACT: the forager will receive the moveScope after it is undone (in 7.6 it receives it before it is undone)
        • OPEN QUESTION: What about the selector that generates the move? In the solver thread or in the move thread?
      • Reproducibility
        • We cannot sacrifice reproducibility for multithreaded solving. That's out of the question.
          • But we are allowed to take shortcuts if environmentMode is non-reproducbility. But this is is an edge case!
        • With 4 move threads, we calculate the score for 4 moves in parallel, that's the whole point of going multithreaded.
          • So the selector calls nextRandom() to generate 4 moves before the first move's result is returned.
          • If the 2nd move is picked, the nextRandom() was already called for move 3 and move 4. There is no escaping that!
            • If there are only 2 move threads, those 2 nextRandom() calls wouldn't have been called and the Random would be in a different state. That kills reproducibility.
          • Therefore reproducibility is limited to when using the same number of move threads.
            • Move threads are likely to default to `cpuSize - 2`.
              • So reproducibility is platform independent, but the info log will report the move thread count.
              • But multithreaded solving won't be enabled by default anyway.
            • It would be nice to specify the number of move threads, well above the number cpu's, to reproduce an issue of 64-core machine on an 8-core machine?
          • Guiding principle: if a move might have been generated, it must be generated even if the winning step is already known.
        • Random
          • Some acceptors use Random. For example: Simulated Annealing. Acceptor.isAccepted() is done in the move thread.
          • Using the same Random instance concurrently across multiple threads kills reproducibility because there is no way to control the order.
          • So each move thread will get its own random, seeded by the solver thread's random.
            • QA question: does this hurt "uniform distribution"? With a simple monte-carlo simulation we can answer that question. Or we can ask on math/CS's stackoverflow.
            • The alternative is to do acceptance in the solver thread, which has 2 disadvantages: less parallelization and the move is already undone before isAcceptance. The latter is a no-go.
        • The solver thread and move threads communicate by sending move wrappers (potentially MoveScope instances) back and forth.
          • With Move.rebase() we can easily rebase those to the new thread's workingSolution.
          • This leads us into producers/consumers, queues, etc. How do we design this?

      Proposal A: 1 moveQueue and 1 scoreQueue on the solver thread

      • The solver thread generates the moves.
        • It controls how many unforaged moves are in created parallel, usually `moveThreadCount * 3`. => Reproducible.
      • The generated moves go into the global moveQueue. The move threads take from that.
        • WEAKNESS: move threads hit that one moveQueue concurrently
        • It fills up the queue to the "unforagedMoveMaximum".
          • For example, with 4 move threads (on a 6 core machine, but that doesn't matter) and the default buffer size of 3, there would be `3*4 = 12` unforaged moves in play.
      • In a loop, the move threads take a move from the moveQueue, process it and put the result on the scoreQueue.
        • If the moveQueue blocks, they wait. So they are effectively bounded by that unforagedMoveMaximum. (this is a good thing)
        • The scoreQueue is unbounded, because it is bounded by that unforagedMoveMaximum implicitly due to the behaviour below
      • In a loop, the solver thread continuously takes 1 processed move from the scoreQueue, forages it, and generates a new unprocessed move in the moveQueue.
        • If the winning step is chosen, it stops after foraging.
          • The number of generated moves is stable. => Reproducible
        • STRENGTH: if 1 move thread is very slow, the other moves threads can process twice as many moves, without blocking.
          • In some cloud containers, we get 2.5 CPU's. This means we get 2 full cpu's and 50% time for another CPU.
      • The scoreQueue needs to sort itself based on the moveIndex, so that it outputs moves in the same order as they were generated.
        • WEAKNESS: if a slow move still isn't calculated by the time that all the other move threads calculated all other 11 moves after that move, those move threads will block on moveQueue.take()
          • Workaround: increase buffer size. When playing with ruinAndRecreate, it might be wise to have a large buffer size of 200.
            • There is no real harm in this, besides potentially doing 199 discarded moves when Threads could have been blocking instead.

      Proposal B: every move thread has a moveQueue and a scoreQueue

      • The solver thread generates the moves.
        • It controls how many unforaged moves are created in parallel, usually `moveThreadCount * 3`. => Reproducible.
      • The generated moves go round-robin into the moveThreads' moveQueues.
        • So with 4 moveThreads, each moveQueue gets 3 moves, so again 12 unforaged moves in play.
        • STRENGTH: Very little concurrency on the moveQueue.
          • But still the same concurrency on the scoreQueue (which is unavoidable really - at least without heavy forager refactors, which is out of scope for now)
      • In a loop, the solver thread continuously goes to the next move thread, takes 1 processed move from its scoreQueue, forages it, and generates a new unprocessed move in its moveQueue.
        • WEAKNESS: if 1 move thread is very slow, the other moves threads will block on the scoreQueue.
          • The slowest move thread effectively sets the pace: all other move threads become equally slow.
          • THREAT: a ruinAndRecreate move can be a 100 times slower than a simple ChangeMove.

      Proposal C: every move thread has its own selector and a scoreQueue There is no moveQueue.

      • Each move thread generates its own moves.
        • Its scoreQueue blocks if it would go more than 3 moves in advance, to control the number of moves created in a parallel?
          • Is this enough to guarantee reproducibility?
        • STRENGTH: the selector move generation is also parallel
        • WEAKNESS: in IndictmentLocalSearch (prototype not yet merged), the moveThreads must also enable ConstraintMatch tracking.
        • WEAKNESS/THREAT: slowest move thread sets the pace (see proposal B)

      Proposal D: proposal B with work stealing from the moveQueues, but one global scoreQueue with sorting like in proposal A

      Proposal E: proposal B steal work, but give it back from where you stole it, with sorting there locally

      • Work stealing
        • Every move thread first poll()'s its own queue. If that is empty, it steals from other queues.
          • If all queues are empty, it takes()'s from it's own queue, blocking if it is still empty. This is possible if foragering is slow.
          • The thread to work steal from, is i + 1, i + 3, i + 7, i + 15, etc. ( 1 + 2 = 3, 3 + 4 = 7, 7 + 8 = 15). Modulus queue size of course.
            • This means that if move thread 10, 11 and 12 are empty, that:
              • 10 will look to 11 (empty), then 14 (non-empty)
              • 11 will look to 12 (empty), then 15 (non-empty)
              • 12 will look to 13 (non-empty)
      • Giving it back
        • After the stolen move is calculated, it is given back to the original move thread's scoreQueue.
        • That scoreQueue's output is sorted on moveIndex.

        1. Selection_640.png
          Selection_640.png
          33 kB
        2. Selection_641.png
          Selection_641.png
          31 kB
        3. ABbenchmarkResults.ods
          67 kB

              gdesmet@redhat.com Geoffrey De Smet (Inactive)
              gdesmet@redhat.com Geoffrey De Smet (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated:
                Resolved: