-
Bug
-
Resolution: Done
-
Major
-
rhos-18.0.8
-
None
-
1
-
False
-
-
False
-
?
-
mariadb-operator-container-1.0.12-2
-
rhos-ops-platform-services-pidone
-
None
-
-
-
Sprint 2, Sprint 3
-
2
-
Important
Observed during QA testing in our lab,
When we forcibly destroy a OCP worker/master VM hosting the galera pod that is currently selected as the database traffic endpoint, the other galera pods correctly detect that the pod went away, and a script is correctly executed on one of the surviving pods to update the traffic endpoint. This works as expected.
The pod configuring the new endpoint uses curl to call the API server to update the selector of the service object responsible for balancing database traffic.
If the API server VIP is hosted on the node that was forcibly destroyed, the VIP is is reassigned automatically by OCP, but if that happens while curl is running, this might leave curl stuck for a very long time trying to establish a connection to a stopped node. During this time, the service object cannot be updated, and database traffic is unavailable, even though galera has recovered almost instantly from the loss of a cluster member.
To Reproduce Steps to reproduce the behavior:
- Deploy a RHOSO control plane on OCP
- Look for the current database endpoint
```
$ oc get svc openstack -o wide
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
openstack ClusterIP 172.30.83.191 <none> 3306/TCP 3d5h app=galera,cr=galera-openstack,statefulset.kubernetes.io/pod-name=openstack-galera-1
``` - Look for the node hosting the endpoint pod
```
$ oc get pod -l galera/name=openstack -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
openstack-galera-0 1/1 Running 0 18h 192.168.20.105 master-2 <none> <none>
openstack-galera-1 1/1 Running 0 18h 192.168.16.128 master-1 <none> <none>
openstack-galera-2 1/1 Running 1 18h 192.168.24.26 master-0 <none> <none>
```
- Crash the node on the hypervisor
```
virsh destroy cifmw-ocp-master-1
```
Expected behavior
- the selector in the service object should be updated to a surviving pod fairly quickly
Bug impact
- Without a fix, the recovery of the database traffic can take a unacceptably long time.
Known workaround
- No automatic workaround.
- split to
-
OSPRH-18408 galera endpoint failover might rely on stale endpoints data during OCP worker outage
-
- Closed
-
- links to