-
Enhancement
-
Resolution: Done
-
Major
-
4.15.0, 4.6.1.CP12
-
None
The current XAResource recovery algorithm includes a requirement that an suspect orphan Xid appear in two recovery scans before it is eligible for rollback under presumed abort. This closes the timing window where a tx branch has been prepared for normal termination but not yet committed. It's based on the assumption that the interval between scans is long enough for any normally executing tx to proceed from prepare to commit. Where the scans occur as part of consecutive recovery passes, this is generally the case.
However...
There are two cases in which scans can happen in quick succession, thus nullifying the safeguard and causing the recovery system to incorrectly presume abort on a branch. The first is caused by top down recovery running in the same pass before bottom up recovery. If a tx log contains an xaresourcerecord, its instantiation may cause a recovery scan. The second and more common case is where the user has incorrectly registered two or more recovery resources for the same RM. At first glance it seems possible to add a check to prevent this misconfiguration, but in practice it's not possible to write a robust comparison method that will work with all known RMs. Thus, as it is infeasible to prevent multiple scans occurring in quick succession, it is instead necessary to change the safeguard algorithm.
We should require a set interval to pass after the first sighting of an Xid before considering it eligible for presumed abort, rather than requiring a given number of scan passes. This will require modification of XARecoveryModule.xaRecovery and RecoveryXids. An additional safeguard may be created by supplementing the existing transaction log based XAResourceOrphanFilter with one based on TransactionImple.getTransaction, at least for configurations where the recovery manager is running in-process.
- relates to
-
JBTM-924 recovery leaks xaresources
- Closed