Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Major
Fix Version/s: rhos-18.0.11
Affects Version/s: rhos-18.0.8
Component/s: mariadb-operator
Labels:
None

Story Points:
3
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
mariadb-operator-container-1.0.13-2
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:
Errata Link:
https://errata.engineering.redhat.com/advisory/153488

Sprint:
Sprint 3, Sprint 4
sprint_count:
2
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

This is a follow up to ~~OSPRH-17604~~,

Observed during QA testing in our lab,

The operator script that implements service endpoint failover contains internal logic to probe the up-to-date state of the gcomm cluster. This is done when the script starts, or when a command failed and is retried.

The list of members is extracted from a mysql table which is not guaranteed to be up-to-date when e.g. a node disappears from the cluster due to a network partition. Consequently, sometimes the failover heuristic can chose a node that is no longer part of the galera cluster, which leads to long service outage.

To Reproduce Steps to reproduce the behavior:

Deploy a RHOSO control plane on OCP
Look for the current database endpoint
```
$ oc get svc openstack -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
openstack ClusterIP 172.30.83.191 <none> 3306/TCP 3d5h app=galera,cr=galera-openstack,statefulset.kubernetes.io/pod-name=openstack-galera-1
```
Look for the node hosting the endpoint pod
```
$ oc get pod -l galera/name=openstack -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack-galera-0 1/1 Running 0 18h 192.168.20.105 master-2 <none> <none>
openstack-galera-1 1/1 Running 0 18h 192.168.16.128 master-1 <none> <none>
openstack-galera-2 1/1 Running 1 18h 192.168.24.26 master-0 <none> <none>
```

Crash the node on the hypervisor
```
virsh destroy cifmw-ocp-master-1
```

Expected behavior

the selector in the service object should be updated to a surviving pod fairly quickly

Bug impact

Without a fix, the recovery of the database traffic can take a long time.

Known workaround

No automatic workaround.

split from

OSPRH-17604 galera endpoint failover can take a long time during OCP worker outage

Closed

links to

https://github.com/openstack-k8s-operators/mariadb-operator/pull/349

openstack-k8s-operators/mariadb-operator#348: Fetch up-to-date gcomm members list during a failover

RHBA-2025:153488 Control plane Operators for RHOSO 18.0.11.

mentioned on

Merge request - Updated US source to: e3a64bd Merge pull request #349 from openshift-cherrypick-robot/cherry-pick-348-to-18.0-fr3

Assignee:: Damien Ciabrini

Reporter:: Damien Ciabrini

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2025/07/21 3:26 PM

Updated:: 2025/08/28 2:39 PM

Resolved:: 2025/08/28 2:39 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty