Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: None
Component/s: ovn-operator
Labels:
None

Story Points:
0
Epic Link:
[BugEpic]: OVN DB crash on ovsdbserver-Xb-0 causes a split brain
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Ready:
False
Docs Approval:
?
AssignedTeam:
rhos-connectivity-neutron-quark
Regression:
None
Intelligence Requested:
Market:
PX Impact Score:

Severity:
Moderate

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

To Reproduce Steps to reproduce the behavior:

A customer was doing a failure test and we noticed that a OVN DB crash on ovsdbserver-Xb-0 causes a split brain on the OVN DB raft cluster.

Deploy a RHOSO 18.0 cluster

Make a OVN DB failure by either way of the following two ways:

Delete ovsdb file from ovsdbserver-Xb-0 pod and recreate the pod

$ oc rsh ovsdbserver-Xb-0 rm /etc/ovn/ovnnb_db.db
$ oc delete pod ovsdbserver-Xb-0

Recreate the PV/PVC/pod of ovsdbserver-Xb-0

$ oc delete pv <pv_of_ovsdbserver-Xb-0>
$ oc delete pvc <pvc_of_ovsdbserver-Xb-0>
$ oc delete pod ovsdbserver-Xb-0

Split brain occurs.

[root@util ~]#  oc rsh -n openstack ovsdbserver-nb-0 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:"
Cluster ID: bbec (bbec14b2-4fe0-4e28-979b-6555a8642cf3)
Status: cluster member
Role: leader

   ===> This shows a different Cluster ID, which means a split brain

[root@util ~]# oc rsh -n openstack ovsdbserver-nb-1 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:"
Cluster ID: 0531 (05312969-86e8-4cb7-9e10-c13e46065566)
Status: cluster member
Role: follower

[root@util ~]#  oc rsh -n openstack ovsdbserver-nb-2 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:"
Cluster ID: 0531 (05312969-86e8-4cb7-9e10-c13e46065566)
Status: cluster member
Role: leader

This split brain is not solved until the OVN DB cluster is rebuilt manually according to https://access.redhat.com/solutions/7135559

This issue doesn't occur on ovsdbserver-Xb-1 or ovsdbserver-Xb-2.
This issue occurs only when we destroy the ovsdb file of ovsdbserver-Xb-0.

The difference comes from the following script.
ovsdbserver-Xb-0 is a bootstrap pod and it creates a new raft cluster when it doesn't have the ovsdb file.
That's why the split brain occurs only when I delete the ovsdb file from ovsdbserver-Xb-0.

[root@util ~]#  oc rsh ovsdbserver-nb-1 cat /usr/local/bin/container-scripts/..2025_07_04_03_38_34.3858988983/setup.sh
   :
if [[ "$(hostname)" != "ovsdbserver-nb-0" ]]; then
    #ovsdb-tool join-cluster /etc/ovn/ovn${DB_TYPE}_db.db ${DB_NAME} tcp:$(hostname).ovsdbserver-nb.${NAMESPACE}.svc.cluster.local:${RAFT_PORT} tcp:ovsdbserver-nb-0.ovsdbserver-nb.${NAMESPACE}.svc.cluster.local:${RAFT_PORT}
    OPTS="--db-${DB_TYPE}-cluster-remote-addr=ovsdbserver-nb-0.ovsdbserver-nb.${NAMESPACE}.svc.cluster.local --db-${DB_TYPE}-cluster-remote-port=${RAFT_PORT}"
fi
   :

[root@util ~]#  oc get pod  ovsdbserver-nb-0 -o yaml
      volumeMounts:
      - mountPath: /usr/local/bin/container-scripts
        name: scripts
        readOnly: true

Expected behavior

Even if ovsdb file is broken on ovsdbserver-Xb-0, the raft cluster is recovered automatically and the split brain doesn't occur.

Bug impact

When customers hit a failure that breaks the ovsdb, OVN DB will have split brain, and it's not recovered automatically.

Known workaround

The split brain is solved by manually rebuild the OVN DB cluster according to https://access.redhat.com/solutions/7135559

Assignee:: Terry Wilson

Reporter:: Yamato Tanaka

Team:: rhos-dfg-networking-squad-neutron

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Created:: 2026/01/15 12:45 AM

Updated:: 2026/03/10 1:39 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

PagerDuty