-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
rhos-16.2.z
-
None
-
13
-
False
-
-
False
-
?
-
rhos-ops-platform-services-pidone
-
None
-
-
-
-
Moderate
Galera node sometimes fail to fully synchronize after joining the cluster.
This is a follow up of https://issues.redhat.com/browse/OSPRH-12518
This is another attempt at coming up with a conclusive RCA for issue
reported in OSPRH-12518; under circumstances that involve a heavy
loaded environment, a Galera node joining an existing cluster would
correctly integrate the data it received from the SST it requested
over rsync, but somehow fail to behave correctly after it integrate
the remaining write sets over IST to catch up with the state of the
cluster.
As an additional information, although SST can be implemented with
rsync or mariabackup, there is currently no evidence that the latter
would prevent this problem from occuring. Moreover we have witnessed
a handful of cases where the error message "WSREP: Failed to apply trx"
reported in this Jira was seen in other environments. So the goal
of this Jira is to track the similarities and hopefully come up
with a definitive explanation as to why this problem can occur,
and whether this is still an issue in OSP 17 and beyond.
For reference, after an initial discussion, we could not determined
from the logs of the original case that the rsync SST misbehaved in
any way. While this is no proof, we are thinking that the issue may lie
in the way the IST is integrated post SST.
This Jira is a tracker to log actions that are needed to try to
reproduce a the issue under the right environment conditions.
- relates to
-
OSPRH-12518 BZ#2252279 [OSP 16] galera replication fails after SST with "[ERROR] WSREP: Failed to apply trx"
-
- Closed
-