Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-77068

Mac Flapping Observed for EgressIP when nodes being rebooted

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • Customer Escalated
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      When node is being rebooted, the ovnkube-cluster-manager detects that the health probe receives a connection timeout It immediately reassigns the EgressIP to a new node. The IP is not removed from the rebooting node. 

      The newly assigned node broadcasts a Gratuitous ARP (GARP) announcement to update the local network with its MAC address. The old node will respond to any arp ping as well as the IP is not removed from the old node yet. This "dual MAC" condition remains active until the networking services on the old node become inactive or enter a "dead" state during the shutdown sequence done by systemd. 

      The process of shutting down the network services on the old node can take a long time depending on the size and load on the node. The time can be long enough to disrupt the networking infrastructure and cause Switche, firewals or other devices to malfunction or block the EgressIP traffic. 

      Version-Release number of selected component (if applicable):

          OCP 4.18

      How reproducible:

          100%

      Steps to Reproduce:

          1. Shutdown a node with EgressIPs

          2. Monitor the network traffic

          3. See ARPs / GARPs for the same EgressIP coming from multiple nodes.

      Actual results:

      Switches get overloaded, Ports get blocked depending on the network infrastructure.

      Expected results:

      The old node should remove the EgressIP as soon as possible. The time when the IP is announced from multiple two nodes should be limited to minimum. Reboot is not sufficient as the reboot time is not predictable. 

      Additional info:

      EgressIP Failover Dynamics: Normal Behavior and the Impact of Graceful Shutdowns/Reboots

       

          

              rhn-support-arghosh Arnab Ghosh
              rhn-support-rhodain1 Roman Hodain
              Anurag Saxena Anurag Saxena
              None
              Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated: