During my work on
CLOUD-2261 I found instabilities in the PostgreSQLXARecoveryWithNFSDisconnectLoadTest. I was trying to tune the test a bit and investigate and I would like to summarize my findings. Nevertheless I haven't tracked down to the root cause or being able to fix it.
My expectation from the test: have a request load to the service, disconnect a pod, set up network again, stop sending requests, wait for recovery handling having time to get system to the consistent state and check the results.
What I think is currently an issue is fact that processing requests do not finish at time the clients (`HttpWorker`) are stopped (https://gitlab.cee.redhat.com/xpaas-qe/xpaas-qe/blob/master/test-eap/src/test/java/com/redhat/xpaas/eap/xa/load/AbstractSQLXARecoveryLoadTest.java#L145) then there is still long time (in minutes, like 5+) while requests are still processed. I haven't find who is pooling them or why they are processed so long. If there is waiting for all request being processed (like 10 min wait) and then recovery is left to make system consistent, it seems the system runs the test fine and in stable mode.