-
Feature Request
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
---
-
---
Current Issue
Currently, WildFly’s graceful shutdown does not take into consideration that hosting entities can be shut down indefinitely, i.e. there will not be any restart/resume after shutdown/suspension. From the point of view of the Transactions subsystem, this scenario might develop into a data integrity issue. In fact, Narayana’s Recovery Coordinator does not necessarily resolve all in-doubt transactions during its suspension, leaving the resolution of left-over transactions for later, when the Recovery Coordinator gets resumed.
The main purpose of the Recovery Coordinator is to recover transactions through a two-phase cycle. Narayana’s behaviour is premised on the assumption that the Recovery Coordinator will be suspended and then will eventually be resumed. During suspension, the Recovery Coordinator is instructed to attempt one extra recovery cycle before leaving all in-doubt transactions in the Object Store. When the Recovery Coordinator is resumed, the recovery routine starts again and it will try to take care of all in-doubt transactions found in the Object Store.
It is the responsibility of the third-party integrating Narayana to handle how and when the Recovery Coordinator should be suspended. What is missing in Narayana is a coherent API that can be used to discover the number of in-flight and in-doubt transactions that are still due to be completed. When it comes to WildFly and its graceful shutdown (AKA suspension), the ServerActivity implementation (SAI) of the Transactions subsystem gets suspended calling directly the suspension hook of Narayana’s Recovery Coordinator, without considering if the transaction recovery ability is still needed (e.g. by other SAIs or customers’ applications running in the server). As discussed previously, this current configuration is fine in all cases where WildFly is resumed/restarted after suspension/shutdown but a potential data integrity issue can happen when WildFly is shut down forever.
In cloud environments, when hosting entities (e.g. containers, pods, virtual machines, etc.) are scaled down, their state is erased, e.g. their file system, ip address, and memory are deleted. This is exactly the situation where WildFly cannot guarantee that all transactions will be completed and a data integrity issue might occur.
To address this issue, the following points should be developed:
- The Transactions SAI should become aware of how many in-doubt transactions there are in the Object Store and delay its suspension as long as all transactions are completed. In case (some of the) in-doubt transactions fail, the Transactions SAI should wait as long as those transactions are resolved. This would guarantee data integrity, especially in situations when a negative timeout (i.e. indefinite waiting) is employed (e.g. in cloud environments)
- With regards to the suspension logic of the EJB SAI (reference), the suspension handler should retain control (i.e. it should not return) as long as there are in-doubt transactions to be completed. This would guarantee data integrity, especially in situations when a negative timeout (i.e. indefinite waiting) is employed (e.g. in cloud environments)
- Users should be notified when SAIs are delaying WildFly’s suspension. This is especially true when a negative timeout is used (i.e. indefinite graceful shutdown)
Moreover, some nice-to-have points:
- Introducing an asynchronous graceful shutdown would enable all SAIs to receive a pre-suspend/suspend signal concurrently. In this way, the timeout will have the same duration for all SAIs.
- The sequence to suspend SAIs during WildFly’s graceful shutdown should be Last Input First Output (LIFO), i.e. the last SAI that was loaded during startup should be the first SAI to get suspended.
- is blocked by
-
JBTM-3893 Definition in progress: Implement the Recovery Modules which should block the Recovery Manager from suspending
- Pull Request Sent
-
JBTM-3894 Introduce an API and ability to block suspension of the Recovery Manager until a compatible RecoveryModule has completed it's work
- Closed
-
WFCORE-6739 Add ServerActivity ordering to SuspendController execution
- Resolved
- relates to
-
WFLY-18176 WildFly Readiness probe should check the suspended state of the server
- Closed
- links to