Our full build and test suite takes 12hrs to run. This is causing severe strain on our CI resources.
Out of this 12hrs, 10hrs is consumed by the QA test suite (5hrs per orb). It would be good to understand where this time is spent so that we can hopefully bring the test time down.
If there is no reasonable way to bring the timing down, we could chop the tests up into 5 groups (per orb) taking 1hr each. This would make better use of our CI cluster during quiet periods, but will not eleviate the load during busy periods. It is also likely to be a poor usage of resources as I suspect the CPU utilisation of each job will be low.
I suspect the problem is that we have a lot of Thread.sleeps in the tests. Therefore we are likely to see long tests with low resource (CPU) utilisation. However, further investigation is required to understand where the time is spent.
As an initial investigation, it would be good to profile the CPU utilisation on one of the CI nodes (take it offline) whilst running these tests. Also, it would be handy to find out how much time is spent in a Thread.sleep call (assuming that is where the time is spent).
My understanding is that Thread.sleep is used to wait for certain background tasks to complete. It's likely that Byteman could be used to replace the sleeps. However, due to the number of tests this may not be feasible. Alternatively, we may discover that there is a common set of Byteman rules that could be applied to the majority of tests, which may prove "good enough".
Also, we may be able to run these tests in parallel. This could drive up CPU utilisation, but would probably also interfere with the timings of the tests meaning that the current thread.sleeps are no longer adequate.
To summarise, this is a general task to investigate why the tests take so long and to also come up with some candidate solutions to bring the total test duration down.