Uploaded image for project: 'JBoss Web Server'
  1. JBoss Web Server
  2. JWS-2159

[ASFBZ 65454] Race condition with idle thread removal results in request being improperly enqueued by default ThreadPoolExecutor

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • 5.7.0.ER2
    • 5.5.0.GA
    • tomcat
    • None
    • False
    • False
    • +
    • Undefined
    • Workaround Exists
    • Hide

      Set minSpareThreads equal to maxThreads to avoid any idle thread removal allowing for the improperly queued request race condition

      Show
      Set minSpareThreads equal to maxThreads to avoid any idle thread removal allowing for the improperly queued request race condition
    • Hide

      This can be hard to reproduce or see in typical operations, but I was able to reproduce deterministically with some help from byteman. To use byteman and reproduce:

      1. Download the byteman 4.0.13 from https://downloads.jboss.org/byteman/4.0.13/byteman-download-4.0.13-bin.zip
      2. Unzip byteman into install directory <mypath>.
      3. Use the Byteman rules file DecrementWorkerSleep.btm from WorkerQueueRaceCondition.zip
      4. Enable the Byteman instrumentation by adding the following to your JVM options:

      -javaagent:<mypath>/lib/byteman.jar=script:<mypath>/DecrementWorkerSleep.btm,boot:<mypath>/lib/byteman.jar -Dorg.jboss.byteman.transform.all=true
      

      5. Add the helloworld from WorkerQueueRaceCondition.zip to webapps for a simple slow jsp that can be easily used to trigger long requests
      7. Start Tomcat
      8. Make 10 slow requests (/helloworld/hi.jsp?sleep=true), resulting in usage of the 10 core threads
      9. Make an 11th normal request (/helloworld/hi.jsp), resulting in the creation of the 11th non-core thread
      10. Wait ~60 seconds after that to result in the idle timeout of that 11th thread. With the extra byteman delay induced in that operation, you can watch for a message like this:

      --------------------------->ThreadPoolExecutor.compareAndDecrementWorkerCount http-nio-8080-exec-11 sleeping...
      

      11. Upon seeing that message, issue another normal quick request like in step 9 and note it is not getting a response. And without any other activity on the server, the request is never processed until one of the 10 core threads may become available. A heap dump at that time and inspecting the executor/taskQueue confirms only the 10 busy threads in the pool with slow requests while the expected quick request is sitting in the queue doing nothing.

      Show
      This can be hard to reproduce or see in typical operations, but I was able to reproduce deterministically with some help from byteman. To use byteman and reproduce: 1. Download the byteman 4.0.13 from https://downloads.jboss.org/byteman/4.0.13/byteman-download-4.0.13-bin.zip 2. Unzip byteman into install directory <mypath>. 3. Use the Byteman rules file DecrementWorkerSleep.btm from WorkerQueueRaceCondition.zip 4. Enable the Byteman instrumentation by adding the following to your JVM options: -javaagent:<mypath>/lib/byteman.jar=script:<mypath>/DecrementWorkerSleep.btm,boot:<mypath>/lib/byteman.jar -Dorg.jboss.byteman.transform.all= true 5. Add the helloworld from WorkerQueueRaceCondition.zip to webapps for a simple slow jsp that can be easily used to trigger long requests 7. Start Tomcat 8. Make 10 slow requests (/helloworld/hi.jsp?sleep=true), resulting in usage of the 10 core threads 9. Make an 11th normal request (/helloworld/hi.jsp), resulting in the creation of the 11th non-core thread 10. Wait ~60 seconds after that to result in the idle timeout of that 11th thread. With the extra byteman delay induced in that operation, you can watch for a message like this: --------------------------->ThreadPoolExecutor.compareAndDecrementWorkerCount http-nio-8080-exec-11 sleeping... 11. Upon seeing that message, issue another normal quick request like in step 9 and note it is not getting a response. And without any other activity on the server, the request is never processed until one of the 10 core threads may become available. A heap dump at that time and inspecting the executor/taskQueue confirms only the 10 busy threads in the pool with slow requests while the expected quick request is sitting in the queue doing nothing.

      I've found a race condition that can play out like below any time a non-core thread above minSpareThreads is idling out. This can result in a request to be improperly enqueued like so:

      1. The thread pool has 11 total threads, 10 threads are busy and 1 idle. After the idle timeout (default 60 seconds), the idle thread pool member ends its ThreadPoolExecutor.getTask()/workQueue.poll() wait on the task queue. It hasn't yet called ThreadPoolExecutor.compareAndDecrementWorkerCount()/ThreadPoolExecutor.processWorkerExit() to decrement the worker count and remove itself from the executor worker map.

      2. A new connection comes in and is handed off to the executor, which calls TaskQueue.offer. Since the idle thread hasn't removed itself from the map, parent.getPoolSize() still returns 11. Thus, it passes this if check and so it is enqueued:

      https://github.com/apache/tomcat/blob/9.0.x/java/org/apache/tomcat/util/threads/TaskQueue.java#L87

      3. The idle thread then finishes exiting and removes itself from the executor. The executor does not inherently replace that thread under this condition. So there are now 10 busy threads in the pool and no idle thread available to process the new enqueued request so it sits in the queue until one of the other threads becomes idle. This could then typically be a small imperceptible delay by the time another thread is created or returns idle. But worst case, a very large unexpected delay is induced on the new request depending upon the run time of the current 10 busy tasks.

              rmaucher Remy Maucherat
              rhn-support-aogburn Aaron Ogburn
              Santiago Gala Santiago Gala
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: