Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77370

[METALLB][BUG] Duplicated GARPs sent during VIPs move to other nodes

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Major Major
    • None
    • 4.16.z, 4.18.z, 4.20.z
    • Networking / Metal LB
    • None
    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • None
    • None
    • All
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      
      Anytime a node is cordon the VIP is reassigned to another node in the pool and the IP to MAC binding begins to be announced. There is no need for any other action, like restarting speaker pod or reboot of the node. However this new assignment is never permanent and looks like the speakers just leave some sort of "note" about the previous node assignment and once this node is back on the cluster, by being uncordoned, the VIP moves back to it immediately. During my testes I have seen everything single time I cordoned and uncordoned the node, no matter how long it was in that state.
      During this period of cordoning and uncordoning, I saw a lot of dual MAC trying to claim the same IP.
      
      Metallb keeps track now of this using servicel2statuses.metallb.io which is updated when these changes happen. However even this CR is inconsistent where I saw on some it doesn't get deleted, just stays in etcd with empty status while another is created with the new assignment.
      When the VIP returns back to its previous node, then this empty CR gets updated and the newer one actually gets deleted.
      
      However this doesn't seem to be the cause of the duplicate GARPs, just probably another bug so let me know if there is a need for a new one or this is expected for some services.
      
      Looking at the code and the logs from the speakers I don't see anything particularly wrong, so the only thing that comes to mind is the possibility of timing and synchronization of the speakers. Not sure if this is caused by the metallb behavior of "wanting" the VIP to move back to the previous node and in the meantime we have 2 speakers sending GARPs at the same time, since they stop once the VIP is moved back or in cases the node is completely broken and no longer joins the cluster.
      
          

      Version-Release number of selected component (if applicable):

      OCP 4.16+
          

      How reproducible:

      every time
          

      Steps to Reproduce:

          1. Configure IPaddressPool and L2adv
          2. Configure a few applications to be accessed via LoadBalancer service with metallb
          3. Have a tcpdump to monitor ARP traffic for these IPs.
          4. Force the VIPs to be allocated to another node by cordoning a node. Wait a few seconds and uncordon the node. Repeat for other nodes where other VIPs maybe allocated
          

      Actual results:

          We see multiple GARPs being sent by both nodes which tcpdump warns they might be duplicate. On normal circumstances this is not an issue, however many customers with solutions that don't allow such behavior this causes the traffic to be blocked.
          

      Expected results:

      
          

      Additional info:

      
          

              fpaoline@redhat.com Federico Paolinelli
              rhn-support-andcosta Andre Costa
              Arti Sood Arti Sood
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated: