Loading...

XML

Word

Printable

Type: Bug
Resolution: Done
Priority: Major
Fix Version/s: rhos-18.0.10 FR 3
Affects Version/s: rhos-18.0.8
Component/s: mariadb-operator
Labels:
None

Story Points:
1
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
Fixed in Build:
mariadb-operator-container-1.0.12-2
AssignedTeam:
rhos-ops-platform-services-pidone
Regression:
None
Intelligence Requested:
Market:

Sprint:
Sprint 2, Sprint 3
sprint_count:
2
Severity:
Important

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

Observed during QA testing in our lab,

When we forcibly destroy a OCP worker/master VM hosting the galera pod that is currently selected as the database traffic endpoint, the other galera pods correctly detect that the pod went away, and a script is correctly executed on one of the surviving pods to update the traffic endpoint. This works as expected.

The pod configuring the new endpoint uses curl to call the API server to update the selector of the service object responsible for balancing database traffic.

If the API server VIP is hosted on the node that was forcibly destroyed, the VIP is is reassigned automatically by OCP, but if that happens while curl is running, this might leave curl stuck for a very long time trying to establish a connection to a stopped node. During this time, the service object cannot be updated, and database traffic is unavailable, even though galera has recovered almost instantly from the loss of a cluster member.

To Reproduce Steps to reproduce the behavior:

Deploy a RHOSO control plane on OCP
Look for the current database endpoint
```
$ oc get svc openstack -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
openstack ClusterIP 172.30.83.191 <none> 3306/TCP 3d5h app=galera,cr=galera-openstack,statefulset.kubernetes.io/pod-name=openstack-galera-1
```
Look for the node hosting the endpoint pod
```
$ oc get pod -l galera/name=openstack -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack-galera-0 1/1 Running 0 18h 192.168.20.105 master-2 <none> <none>
openstack-galera-1 1/1 Running 0 18h 192.168.16.128 master-1 <none> <none>
openstack-galera-2 1/1 Running 1 18h 192.168.24.26 master-0 <none> <none>
```

Crash the node on the hypervisor
```
virsh destroy cifmw-ocp-master-1
```

Expected behavior

the selector in the service object should be updated to a surviving pod fairly quickly

Bug impact

Without a fix, the recovery of the database traffic can take a unacceptably long time.

Known workaround

No automatic workaround.

split to

OSPRH-18408 galera endpoint failover might rely on stale endpoints data during OCP worker outage

Closed

links to

https://github.com/openstack-k8s-operators/mariadb-operator/pull/338

openstack-k8s-operators/mariadb-operator#337: Rework retry/timeout defaults to ensure fast service failover

openstack-k8s-operators/mariadb-operator#348: Fetch up-to-date gcomm members list during a failover

Assignee:: Damien Ciabrini

Reporter:: Damien Ciabrini

Team:: rhos-dfg-pidone

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2025/06/19 10:13 AM

Updated:: 2025/07/30 10:27 AM

Resolved:: 2025/07/30 10:27 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty