Details
-
Bug
-
Resolution: Unresolved
-
Critical
-
None
-
None
-
False
-
None
-
False
Description
Context
When EAP is used in K8s (together with its operator), the right way to trigger the transaction recovery is to decrease the number of replicas defined in the Custom Resource (CR) representing the EAP server. In this way, the operator guarantees that a pod (or multiple pods if the user decides to scale down more than one server) will be scaled down only when there are no transactions left in the object store.
Reproducer
AFAIK, there is a scenario where the operator does not make sure that the pod(s) (doomed to be scaled down) will be removed only when the recovery of transactions is completed.
Steps to reproduce this scenario:
- Initiate the scaledown of a EAP pod that has in-doubt transactions (basically, set the replicas value of the CR to 0)
- Modify the value of StatefulSet’s replicas to match the value defined in the CR
- Result -> the operator is not able to recreate the EAP pod and continue the recovery of the in-doubt transactions
- [NB: In case `oc delete pod tx-*` is executed while the operator is waiting for the Object Store to become empty, the StatefulSet guarantees that the pod will be recreated; in this case, we are covered: the operator restarts the transaction recovery of the new pod]
The purpose of this ticket
The documentation of the EAP Operator explains the right procedure to make sure that transaction recovery is carried out. Nevertheless, this note should be modified:
Decreasing the replica size of the StatefulSet or deleting the pod itself has no effect and such changes are reverted.
In fact, when the StatefulSet is modified while the operator is controlling the scaling down of a EAP pod, the existence of the scaling down pod will be not guaranteed. I propose to modify the note with something like this:
Deleting the pod itself has no effect and such changes are reverted. Also decreasing the replica size of the StatefulSet has no effect and such changes are reverted. Nevertheless, there is a corner case to be considered: when the replica size of the StatefulSet is decreased while the Operator has started the (artificial) scaling down of a pod connected to the StatefulSet, this modification will stop the transaction recovery immediately as the pod gets removed abruptly