Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.19.z
Affects Version/s: 4.18.z
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Incidents & Support
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
Yes

Target Backport Versions:

4.18.z, 4.19.z, 4.20.0
Target Version:

4.19.z
Release Blocker:
None
Sprint:
CORENET Sprint 272, CORENET Sprint 273, CORENET Sprint 274
sprint_count:
3

Customer Impact:

Customer Escalated, Customer Facing

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Known Issue
Release Note Text:

Hide
* Before this update, a cluster upgrade to {product-title} 4.18 caused inconsistent egress IP allocation due to stale Network Address Translation (NAT) handling. This issue occurred only when you deleted an egress IP pod while the OVN-Kubernetes controller for an egress node was down. As a consequence, duplicate Logical Router Policies and egress IP usage occurred, which caused inconsistent traffic flow and outage. With this release, egress IP allocation cleanup ensures consistent and reliable egress IP allocation in {product-title} 4.18 clusters. (link:https://issues.redhat.com/browse/OCPBUGS-59530[~~OCPBUGS-59530~~])

Show
* Before this update, a cluster upgrade to {product-title} 4.18 caused inconsistent egress IP allocation due to stale Network Address Translation (NAT) handling. This issue occurred only when you deleted an egress IP pod while the OVN-Kubernetes controller for an egress node was down. As a consequence, duplicate Logical Router Policies and egress IP usage occurred, which caused inconsistent traffic flow and outage. With this release, egress IP allocation cleanup ensures consistent and reliable egress IP allocation in {product-title} 4.18 clusters. (link: https://issues.redhat.com/browse/OCPBUGS-59530 [ OCPBUGS-59530 ])

Escape Reason:
Escape Impact:
Corrective Measures:
SDLC stage when should've been found:
None

Description of problem:

Recently upgraded cluster with multiple EgressIPs and allocated hosts are demonstrating duplicate NATs for a given pod, and duplicate hosts with the same EgressIP allocation. EgressIP output `oc get eip` lists only one host, but NAT entries on the nodes themselves show overlaps.
Performed OVNKube DB rebuild as per documentation for 4.14+ db rebuild on all egress-assignable nodes
Observed after rebuild, behavior persists, egressIP allocation still duplicated across multiple hosts and egress usage is impacted/unreliable. Pods are natted to node IP as well as egressIP inconsistently.

Version-Release number of selected component (if applicable):

OpenShift 4.18.15

How reproducible:

working on Internal Replication presently, but currently observed on one production cluster in customer environment.

Steps to Reproduce:

1. Deploy cluster with multiple egressIPs and multiple egress-capable host nodes

2. Upgrade to 4.18.15

3. Observe stale nat handling (issue documented further in bug: https://issues.redhat.com/browse/OCPBUGS-50709 ) .

4. Perform workaround outlined in KCS: https://access.redhat.com/solutions/7110252 .

5. Observe after DB rebuilds on nodes that behavior is NOT resolved, and issue persists - egressIP allocation may have migrated to new hosts, but stale entries are still present on multiple endpoints and traffic flow is blocked.

Actual results:

EgressIP is not allocated properly/unreliable

Expected results:

EgressIP should be consistently handled (Pending Remap of egressIP to it's own handler), But more importantly - cleanup tasking to rectify the condition should work consistently.

Additional info:

Given that the known workaround of DB rebuild does NOT appear successful and egress IP handling is still impacted, filing as a new bug, rather than linking to the previous bug/epic for handling, because the behavior currently cannot be resolved. Need engineering assistance for identification of why egressIP allocation behavior is failing to be resolved by db rebuild.
I am working on some better tooling to improve our detection of this behavior, and some possible methods for manual cleanup (TBD), but will need assistance from engineering on prioritizing this, which may involve escalating the epic for egress handler creation if no other workarounds can be identified.
Customer is blocked from upgrading their clusters, and one cluster (currently) is impacted with ongoing impact/outage.

KCS created: https://access.redhat.com/solutions/7125049

clones

OCPBUGS-57179 [4.20][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade

Closed

depends on

OCPBUGS-57179 [4.20][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade

Closed

is cloned by

OCPBUGS-59531 [4.18][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade

Closed

is depended on by

OCPBUGS-57179 [4.20][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade

Closed

OCPBUGS-59531 [4.18][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade

Closed

links to

openshift/ovn-kubernetes#2675: [release-4.19] OCPBUGS-59530,OCPBUGS-48709: DownStream Merge Sync from 4.20 [07-17-2025]

openshift/ovn-kubernetes#2690: OCPBUGS-59530: EgressIP: fix startup syncer

RHBA-2025:12341 OpenShift Container Platform 4.19.7 bug fix update

(3 links to)

Assignee:: Martin Kennelly

Reporter:: Will Russell

Need Info From:: None

Contributors:: None

QA Contact:: Jean Chen

Doc Contact:: None

Involved:: Surya Seetharaman

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Created:: 2025/07/18 10:52 AM

Updated:: 2025/10/10 9:05 AM

Resolved:: 2025/08/05 5:44 AM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide