Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-25341

OVN DB crash on ovsdbserver-Xb-0 causes a split brain

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • None
    • ovn-operator
    • None
    • Moderate

      To Reproduce Steps to reproduce the behavior:

      A customer was doing a failure test and we noticed that a OVN DB crash on ovsdbserver-Xb-0 causes a split brain on the OVN DB raft cluster.

      1. Deploy a RHOSO 18.0 cluster
      2. Make a OVN DB failure by either way of the following two ways:
        • Delete ovsdb file from ovsdbserver-Xb-0 pod and recreate the pod
          $ oc rsh ovsdbserver-Xb-0 rm /etc/ovn/ovnnb_db.db
          $ oc delete pod ovsdbserver-Xb-0
        • Recreate the PV/PVC/pod of ovsdbserver-Xb-0
          $ oc delete pv <pv_of_ovsdbserver-Xb-0>
          $ oc delete pvc <pvc_of_ovsdbserver-Xb-0>
          $ oc delete pod ovsdbserver-Xb-0 
      3. Split brain occurs.
        [root@util ~]#  oc rsh -n openstack ovsdbserver-nb-0 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:"
        Cluster ID: bbec (bbec14b2-4fe0-4e28-979b-6555a8642cf3)
        Status: cluster member
        Role: leader
        
           ===> This shows a different Cluster ID, which means a split brain
        
        [root@util ~]# oc rsh -n openstack ovsdbserver-nb-1 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:"
        Cluster ID: 0531 (05312969-86e8-4cb7-9e10-c13e46065566)
        Status: cluster member
        Role: follower
        
        [root@util ~]#  oc rsh -n openstack ovsdbserver-nb-2 ovs-appctl -t /tmp/ovnnb_db.ctl cluster/status OVN_Northbound|grep -e "Cluster ID:" -e "Status:" -e "Role:"
        Cluster ID: 0531 (05312969-86e8-4cb7-9e10-c13e46065566)
        Status: cluster member
        Role: leader 
      4. This split brain is not solved until the OVN DB cluster is rebuilt manually according to https://access.redhat.com/solutions/7135559

      This issue doesn't occur on ovsdbserver-Xb-1 or ovsdbserver-Xb-2.
      This issue occurs only when we destroy the ovsdb file of ovsdbserver-Xb-0.

      The difference comes from the following script.
      ovsdbserver-Xb-0 is a bootstrap pod and it creates a new raft cluster when it doesn't have the ovsdb file.
      That's why the split brain occurs only when I delete the ovsdb file from ovsdbserver-Xb-0.

       

      [root@util ~]#  oc rsh ovsdbserver-nb-1 cat /usr/local/bin/container-scripts/..2025_07_04_03_38_34.3858988983/setup.sh
         :
      if [[ "$(hostname)" != "ovsdbserver-nb-0" ]]; then
          #ovsdb-tool join-cluster /etc/ovn/ovn${DB_TYPE}_db.db ${DB_NAME} tcp:$(hostname).ovsdbserver-nb.${NAMESPACE}.svc.cluster.local:${RAFT_PORT} tcp:ovsdbserver-nb-0.ovsdbserver-nb.${NAMESPACE}.svc.cluster.local:${RAFT_PORT}
          OPTS="--db-${DB_TYPE}-cluster-remote-addr=ovsdbserver-nb-0.ovsdbserver-nb.${NAMESPACE}.svc.cluster.local --db-${DB_TYPE}-cluster-remote-port=${RAFT_PORT}"
      fi
         :
      
      [root@util ~]#  oc get pod  ovsdbserver-nb-0 -o yaml
            volumeMounts:
            - mountPath: /usr/local/bin/container-scripts
              name: scripts
              readOnly: true 

       

      Expected behavior

      • Even if ovsdb file is broken on ovsdbserver-Xb-0, the raft cluster is recovered automatically and the split brain doesn't occur.

      Bug impact

      • When customers hit a failure that breaks the ovsdb, OVN DB will have split brain, and it's not recovered automatically.

      Known workaround

              twilson@redhat.com Terry Wilson
              rhn-support-yatanaka Yamato Tanaka
              rhos-dfg-networking-squad-neutron
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: