-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
0
-
False
-
-
False
-
?
-
rhos-connectivity-neutron
-
None
-
-
-
-
Moderate
To Reproduce Steps to reproduce the behavior:
A customer was doing a failure test and we noticed that a OVN DB crash on ovsdbserver-Xb-0 causes a split brain on the OVN DB raft cluster.
- Deploy a RHOSO 18.0 cluster
- Make a OVN DB failure by either way of the following two ways:
- Delete ovsdb file from ovsdbserver-Xb-0 pod and recreate the pod
$ oc rsh ovsdbserver-Xb-0 rm /etc/ovn/ovnnb_db.db $ oc delete pod ovsdbserver-Xb-0
- Recreate the PV/PVC/pod of ovsdbserver-Xb-0
$ oc delete pv <pv_of_ovsdbserver-Xb-0> $ oc delete pvc <pvc_of_ovsdbserver-Xb-0> $ oc delete pod ovsdbserver-Xb-0
- Delete ovsdb file from ovsdbserver-Xb-0 pod and recreate the pod
- Split brain occurs.
[root@util ~]# oc rsh -n openstack ovsdbserver-nb-0 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:" Cluster ID: bbec (bbec14b2-4fe0-4e28-979b-6555a8642cf3) Status: cluster member Role: leader ===> This shows a different Cluster ID, which means a split brain [root@util ~]# oc rsh -n openstack ovsdbserver-nb-1 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:" Cluster ID: 0531 (05312969-86e8-4cb7-9e10-c13e46065566) Status: cluster member Role: follower [root@util ~]# oc rsh -n openstack ovsdbserver-nb-2 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:" Cluster ID: 0531 (05312969-86e8-4cb7-9e10-c13e46065566) Status: cluster member Role: leader
- This split brain is not solved until the OVN DB cluster is rebuilt manually according to https://access.redhat.com/solutions/7135559
This issue doesn't occur on ovsdbserver-Xb-1 or ovsdbserver-Xb-2.
This issue occurs only when we destroy the ovsdb file of ovsdbserver-Xb-0.
The difference comes from the following script.
ovsdbserver-Xb-0 is a bootstrap pod and it creates a new raft cluster when it doesn't have the ovsdb file.
That's why the split brain occurs only when I delete the ovsdb file from ovsdbserver-Xb-0.
[root@util ~]# oc rsh ovsdbserver-nb-1 cat /usr/local/bin/container-scripts/..2025_07_04_03_38_34.3858988983/setup.sh : if [[ "$(hostname)" != "ovsdbserver-nb-0" ]]; then #ovsdb-tool join-cluster /etc/ovn/ovn${DB_TYPE}_db.db ${DB_NAME} tcp:$(hostname).ovsdbserver-nb.${NAMESPACE}.svc.cluster.local:${RAFT_PORT} tcp:ovsdbserver-nb-0.ovsdbserver-nb.${NAMESPACE}.svc.cluster.local:${RAFT_PORT} OPTS="--db-${DB_TYPE}-cluster-remote-addr=ovsdbserver-nb-0.ovsdbserver-nb.${NAMESPACE}.svc.cluster.local --db-${DB_TYPE}-cluster-remote-port=${RAFT_PORT}" fi : [root@util ~]# oc get pod ovsdbserver-nb-0 -o yaml volumeMounts: - mountPath: /usr/local/bin/container-scripts name: scripts readOnly: true
Expected behavior
- Even if ovsdb file is broken on ovsdbserver-Xb-0, the raft cluster is recovered automatically and the split brain doesn't occur.
Bug impact
- When customers hit a failure that breaks the ovsdb, OVN DB will have split brain, and it's not recovered automatically.
Known workaround
- The split brain is solved by manually rebuild the OVN DB cluster according to https://access.redhat.com/solutions/7135559