Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-7718

Possible split brain with keepalived unicast

XMLWordPrintable

    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      This is a clone of issue OCPBUGS-1565. The following is the description of the original issue:

      Description of problem:

      We've observed a split brain case for keepalived unicast, where two worker nodes were fighting for the ingress VIP. 
      One of these nodes failed to register itself with the cluster, so it was missing from the output of the node list. That, in turn, caused it to be missing from the unicast_peer list in keepalived. This one node believed it was the master (not receiving VRRP from other nodes), and other nodes constantly re-electing a master.
      
      This behavior was observed in a QE-deployed cluster on PSI. It caused constant VIP flapping and a huge load on OVN.
      

      Version-Release number of selected component (if applicable):

      
      

      How reproducible:

      Not sure. We don't know why the worker node failed to register with the cluster (the cluster is gone now) or what the QE were testing at the time. 
      

      Steps to Reproduce:

      1.
      2.
      3.
      

      Actual results:

      The cluster was unhealthy due to the constant Ingress VIP failover. It was also putting a huge load on PSI cloud.
      

      Expected results:

      The flapping VIP can be very expensive for the underlying infrastructure. In no way we should allow OCP to bring the underlying infra down.
      
      The node should not be able to claim the VIP when using keepalived in unicast mode unless they have correctly registered with the cluster and they appear in the node list.
      

      Additional info:

      
      

            bnemec@redhat.com Benjamin Nemec
            openshift-crt-jira-prow OpenShift Prow Bot
            Zhanqi Zhao Zhanqi Zhao
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: