Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: mariadb-operator
Labels:
- Triaged

Story Points:
5
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Bugzilla Bug:
RHBZ: 2252279
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:

Severity:
Moderate

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Description of problem:

BMW, which is running their galera cluster under a SE [0], faced an incident on 2 controllers [1], and the cluster was not able to go back in sync after this.

[original case description]
We've had a the galera cluster fall apart and are now trying to recover the two inactive servers from the active one via SST/IST.
Every time they finish pulling and start applying new wsrep transactions on top they fail. Logs / Pastes follow.

(will share the log in the #1 comment)

Now, customer is looking for a root cause, sharing also a lot of complains about the recovery strategy they had to use [2]

Version-Release number of selected component (if applicable):
RHOSP 16.2 - mysqld 10.3.32-MariaDB

Additional info:

[0] Support Exception
Using wsrep_sst_method to set as mariabackup
SE: https://issues.redhat.com/browse/SUPPORTEX-11923
BZ: https://bugzilla.redhat.com/show_bug.cgi?id=2076551
Case: https://access.redhat.com/support/cases/#/case/03082217

[1] [customer's incident summary]
Short context as to what happened to cause this:

We have ongoing OVN/OVS issues (ref: #03501413) and we are now starting to see that corosync knet over the default interfaces (br-ex here, so OVS) seems to be unstable enough to cause flapping in pacemaker.
Whatever the issues are that cause the ovn-dbs to fall apart is yet unknown, but if they do fall apart we get into a split-brain between the neutron-db in the galera cluster and the OVN Northbound in the ovn-dbs-cluster. That manifests in port-deletes not propagating from neutron to northbound, so we ended up with 1700+ ports in OVN that were not in neutron. The current way to mitigate that is to nbctl lsp-delete the ports from the ovn-dbs.

After we cleaned that up we ran sos report --all-logson all three openstack controller nodes to ensure that we've got everything captured to escalate #03501413 further. This caused two of the three controllers to experience kernel hung task timeouts (example kernelmessages from one of them will be attached), which in turn caused all four NICs to go down and reset - effectively leaving us with one master and two syncing DB-slaves.

These DB-Slaves both showed the behaviour initially reported, so they were pulling SSTs, then switching to the "normal" live replication and failing while applying the replicated transactions.
To us that does not make any sense at all, as the only way that SST + Replication should be able to fail that way would be that the SST that was transferred was not consistent in respect to the replication-journal-position it claimed to be corresponding to. This should not be possible.
[/customer's summary]

[2] [customer's concerns about the issues]

I've had a bunch galera clusters break over the years and SSTs or ISTs were always working just fine. However, in this case, we continously got "Internal MariaDB error code: 1032" on various deletes, each time on different tables, and different on either of the two SST/IST target nodes. This continued until we decided to stop all connections to mysql through a temporarily modified clustercheck (have it return 503 always so that haproxy has nothing to send connections to). We then had 0 connections to mysql and did two SSTs to the other nodes, undid the clustercheck modification, and everything was fine again.

The question is, why did we have to drop all connections to mysql? If an SST happens, it goes into DONOR state - haproxy then already shuts down all sessions. It does the SST/IST, goes back to SYNC and it takes connections again. So techincally it's the same, but in reality there is something different. To me it looks like either at the start of the SST (entering DONOR) or at the end (entering SYNCED), transactions are happening that are "missed" in the slave replication and/or in the frozen dataset. And this should never happen, regardless of the amount of connections open against mysql at those points in time (~5000 tcp ESTABLISHED).

If this is a limitation of the rsync method - why arent you using mariabackup instead?

If there is a race condition - how do we add some pre- and post delays?

Or is this a bug in the current mariadb/galera code that has already been fixed?
[/customer's concerns about the issues]

is related to

OSPRH-19006 Galera node sometimes fail to fully synchronize after joining the cluster.

Backlog

split to

OSPRH-15174 Mariabackup support for SST

Refinement

external trackers

Red Hat Customer Portal 03677051

Assignee:: Damien Ciabrini

Reporter:: RH Bugzilla Integration

QA Contact:: Daniel Barzilay

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Created:: 2023/11/30 3:06 PM

Updated:: 2025/09/13 7:21 PM

Resolved:: 2025/07/25 10:52 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty