Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.18
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:EgressIP

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Moderate
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

When node is being rebooted, the ovnkube-cluster-manager detects that the health probe receives a connection timeout It immediately reassigns the EgressIP to a new node. The IP is not removed from the rebooting node.

The newly assigned node broadcasts a Gratuitous ARP (GARP) announcement to update the local network with its MAC address. The old node will respond to any arp ping as well as the IP is not removed from the old node yet. This "dual MAC" condition remains active until the networking services on the old node become inactive or enter a "dead" state during the shutdown sequence done by systemd.

The process of shutting down the network services on the old node can take a long time depending on the size and load on the node. The time can be long enough to disrupt the networking infrastructure and cause Switche, firewals or other devices to malfunction or block the EgressIP traffic.

Version-Release number of selected component (if applicable):

OCP 4.18

How reproducible:

100%

Steps to Reproduce:

1. Shutdown a node with EgressIPs

2. Monitor the network traffic

3. See ARPs / GARPs for the same EgressIP coming from multiple nodes.

Actual results:

Switches get overloaded, Ports get blocked depending on the network infrastructure.

Expected results:

The old node should remove the EgressIP as soon as possible. The time when the IP is announced from multiple two nodes should be limited to minimum. Reboot is not sufficient as the reboot time is not predictable.

Additional info:

EgressIP Failover Dynamics: Normal Behavior and the Impact of Graceful Shutdowns/Reboots

Assignee:: Arnab Ghosh

Reporter:: Roman Hodain

QA Contact:: Anurag Saxena

Need Info From:: None

Votes:: 3 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2026/02/19 2:47 PM

Updated:: 2026/02/26 1:20 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates