Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-12518

BZ#2252279 [OSP 16] galera replication fails after SST with "[ERROR] WSREP: Failed to apply trx"

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major Major
    • None
    • None
    • mariadb-operator
    • 5
    • False
    • Hide

      None

      Show
      None
    • False
    • rhos-ops-platform-services-pidone
    • None
    • Moderate

      Description of problem:

      BMW, which is running their galera cluster under a SE [0], faced an incident on 2 controllers [1], and the cluster was not able to go back in sync after this.

      [original case description]
      We've had a the galera cluster fall apart and are now trying to recover the two inactive servers from the active one via SST/IST.
      Every time they finish pulling and start applying new wsrep transactions on top they fail. Logs / Pastes follow.

      (will share the log in the #1 comment)

      Now, customer is looking for a root cause, sharing also a lot of complains about the recovery strategy they had to use [2]

      Version-Release number of selected component (if applicable):
      RHOSP 16.2 - mysqld 10.3.32-MariaDB

      Additional info:

      [0] Support Exception
      Using wsrep_sst_method to set as mariabackup
      SE: https://issues.redhat.com/browse/SUPPORTEX-11923
      BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2076551
      Case: https://access.redhat.com/support/cases/#/case/03082217

      [1] [customer's incident summary]
      Short context as to what happened to cause this:

      We have ongoing OVN/OVS issues (ref: #03501413) and we are now starting to see that corosync knet over the default interfaces (br-ex here, so OVS) seems to be unstable enough to cause flapping in pacemaker.
      Whatever the issues are that cause the ovn-dbs to fall apart is yet unknown, but if they do fall apart we get into a split-brain between the neutron-db in the galera cluster and the OVN Northbound in the ovn-dbs-cluster. That manifests in port-deletes not propagating from neutron to northbound, so we ended up with 1700+ ports in OVN that were not in neutron. The current way to mitigate that is to nbctl lsp-delete the ports from the ovn-dbs.

      After we cleaned that up we ran sos report --all-logson all three openstack controller nodes to ensure that we've got everything captured to escalate #03501413 further. This caused two of the three controllers to experience kernel hung task timeouts (example kernelmessages from one of them will be attached), which in turn caused all four NICs to go down and reset - effectively leaving us with one master and two syncing DB-slaves.

      These DB-Slaves both showed the behaviour initially reported, so they were pulling SSTs, then switching to the "normal" live replication and failing while applying the replicated transactions.
      To us that does not make any sense at all, as the only way that SST + Replication should be able to fail that way would be that the SST that was transferred was not consistent in respect to the replication-journal-position it claimed to be corresponding to. This should not be possible.
      [/customer's summary]

      [2] [customer's concerns about the issues]

      I've had a bunch galera clusters break over the years and SSTs or ISTs were always working just fine. However, in this case, we continously got "Internal MariaDB error code: 1032" on various deletes, each time on different tables, and different on either of the two SST/IST target nodes. This continued until we decided to stop all connections to mysql through a temporarily modified clustercheck (have it return 503 always so that haproxy has nothing to send connections to). We then had 0 connections to mysql and did two SSTs to the other nodes, undid the clustercheck modification, and everything was fine again.

      The question is, why did we have to drop all connections to mysql? If an SST happens, it goes into DONOR state - haproxy then already shuts down all sessions. It does the SST/IST, goes back to SYNC and it takes connections again. So techincally it's the same, but in reality there is something different. To me it looks like either at the start of the SST (entering DONOR) or at the end (entering SYNCED), transactions are happening that are "missed" in the slave replication and/or in the frozen dataset. And this should never happen, regardless of the amount of connections open against mysql at those points in time (~5000 tcp ESTABLISHED).

      If this is a limitation of the rsync method - why arent you using mariabackup instead?

      If there is a race condition - how do we add some pre- and post delays?

      Or is this a bug in the current mariadb/galera code that has already been fixed?
      [/customer's concerns about the issues]

              rhn-engineering-dciabrin Damien Ciabrini
              jira-bugzilla-migration RH Bugzilla Integration
              Daniel Barzilay Daniel Barzilay (Inactive)
              rhos-dfg-pidone
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: