Currently parallel boot works by executing one task per Extension.initializeParsers call, one per Extension.initialize call and 2 per subsystem, one for execution of Stage.MODEL ops and one for Stage.RUNTIME ops. The Extension.initializeParsers tasks complete before boot proceeds to the point where any Extension.initialize tasks run, and the Extension.initialize tasks complete before the Stage.MODEL tasks run. The Stage.MODEL tasks do the large bulk of their work before the Stage.RUNTIME tasks run, but they do block waiting for the Stage.RUNTIME tasks and the rest of the boot to complete.
The rough effect of all this is we are allocating 2 threads per subsystem to do parallel boot, and at various points we have 1 thread per subsystem concurrently working. For a brief period (doing Stage.DONE of the post-extension boot op) we have 2 threads per subsystem concurrently working.
My measurements show that all of this concurrent work reduces boot time about 400ms on my machine, using the full WildFly standalone-full.xml config. However, this approach uses a lot of threads. So the task here is to look into how to get the same or better boot speed while using fewer threads. (Note the threads will expire and be gc'd after boot.)
The obvious way to do this is to look at each of the 4 task types discussed in the first paragraph and group things into larger units of work than a single extension/subsystem.
Initial work on doing this shows that using more coarse grained chunks does not result in reduced boot time, but also seems not to increase boot time. Further measurement is needed to confirm this though, and small tweaks may show different results.
Another thing to consider is allowing the Stage.MODEL tasks to complete without waiting for the overall boot op to complete. This might reduce the max number of threads involved and perhaps will allow a tiny bit more parallelization of work. The key here is ensuring the Stage.MODEL tasks are not able to affect the state of the final system in an invalid way. That could be problematic or fragile, so it's just something to consider, and if done must be done with great care.
Even if this work produces no reduction in boot time, if it produces no increase there is some value in incorporating it, as avoiding unnecessary thread creation improves the impression of the efficiency and good design of the software.
In particular, with a default thread stack size of 1024K, allocating an extra 50+ threads at boot means the process will consume an extra 50MB of RSS beyond what it would otherwise need. That memory should eventually be returned to the OS, and it's possible that later use of the server will result in a peak memory use after boot that's higher than what's needed at boot, but still, in a memory constrained environment (think cloud with applications trying to live in a smaller memory budget), requesting an extra 50MB beyond what provides benefit is not immaterial.