-
Bug
-
Resolution: Done-Errata
-
Major
-
rhos-18.0.8
-
None
-
3
-
False
-
-
False
-
?
-
mariadb-operator-container-1.0.13-2
-
rhos-ops-platform-services-pidone
-
None
-
-
-
Sprint 3, Sprint 4
-
2
-
Important
This is a follow up to OSPRH-17604,
Observed during QA testing in our lab,
The operator script that implements service endpoint failover contains internal logic to probe the up-to-date state of the gcomm cluster. This is done when the script starts, or when a command failed and is retried.
The list of members is extracted from a mysql table which is not guaranteed to be up-to-date when e.g. a node disappears from the cluster due to a network partition. Consequently, sometimes the failover heuristic can chose a node that is no longer part of the galera cluster, which leads to long service outage.
To Reproduce Steps to reproduce the behavior:
- Deploy a RHOSO control plane on OCP
- Look for the current database endpoint
```
$ oc get svc openstack -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
openstack ClusterIP 172.30.83.191 <none> 3306/TCP 3d5h app=galera,cr=galera-openstack,statefulset.kubernetes.io/pod-name=openstack-galera-1
``` - Look for the node hosting the endpoint pod
```
$ oc get pod -l galera/name=openstack -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack-galera-0 1/1 Running 0 18h 192.168.20.105 master-2 <none> <none>
openstack-galera-1 1/1 Running 0 18h 192.168.16.128 master-1 <none> <none>
openstack-galera-2 1/1 Running 1 18h 192.168.24.26 master-0 <none> <none>
```
- Crash the node on the hypervisor
```
virsh destroy cifmw-ocp-master-1
```
Expected behavior
- the selector in the service object should be updated to a surviving pod fairly quickly
Bug impact
- Without a fix, the recovery of the database traffic can take a long time.
Known workaround
- No automatic workaround.
- split from
-
OSPRH-17604 galera endpoint failover can take a long time during OCP worker outage
-
- Closed
-
- links to
-
RHBA-2025:153488 Control plane Operators for RHOSO 18.0.11.
- mentioned on