When server to server calls ejb remote calls where transaction context is propagated then ejb call can be routed to a one pod where the recovery call may directed to a different pod.
Such situation causes a consistency issue.
Let's say the scenario: the first server (let's call it `tx-client`) makes remote ejb call to remote server which is on of the servers joint in cluster named `tx-server-0` and `tx-server-1`. The `tx-client` calls the `tx-server-1`. The processing continues up to the start of the 2PC and the `tx-server-1` crashes (or host goes down, network issue happens...).
`tx-client` understands that the process was not succesful and ask recovery manager to retry and finish.
The recovery manager starts to call the remote server based on data saved in the object store of `tx-client`.
But unfortunately the recovery remote call goes not to the `tx-server-1` but to `tx-server-0`. The `tx-client` gets error code `XAException.XAER_NOTA` (`-4`) and removes data from its object store (`/opt/eap/standalone/data/tx-object-store/`, `/opt/eap/standalone/data/ejb-xa-recovery`) and then never finishes in-doubt transactions at `tx-server-1`.
It's in doubt if it's issue of OpenShift configuration or if it's a trouble of WFTC/ejb/remoting layer in WildFly.
This is tested with WFLY Operator from 2019-09-26 `@90a2b3b`.
- relates to
-
WFWIP-201 incomplete tx recovery on openshift
- Resolved