Loading...

XML

Word

Printable

Type: Bug
Resolution: Duplicate
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.18.z
Component/s: Networking / ovn-kubernetes
Labels:
- SDN:OVNK:EgressIP

Activity Type:
None
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Low
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

platform upgraded from 4.17.45 to 4.18.30 was completed successfully. Planned step-through to 4.19.z, was paused when it was discovered that egressIP traffic is interrupted - significant traffic loss/interruption. 

Observed that egressIP is continually selecting new peer host nodes and remains unstable. During failover, traffic is interrupted as we move to neighbor nodes. 

Original configuration was 7 nodes with egress-assignable capacity, using secondary vrf networks (bond1-3045@bond1) to host targetIPs. 2 IPs were allocated to an egressiP object: egress-qa

This object has been paired down to 1 egressIP to simplify testing, and it has been observed that all labeled nodes are cycled through.

Observed that when we remove the labels from all nodes and label only one host, we have variable connectivity upstream - some nodes appear to be able to make new connections and fail intermittently, and some nodes fail continually and are able to connect intermittently. But no node is "stable". 

This is in stark contrast to 4.17 where behavior was not present. 

Customer performed workaround in KCS: https://access.redhat.com/solutions/7125049: removed all egressIPs, performed ovnkube db rebuild on all nodes, re-added egressIPs. Behavior appears immediately.


`reachabilityTotalTimeoutSeconds: 1` has been set so we see the failure condition frequently as we continue to bounce around nodes every 4s or so (approximate)

Version-Release number of selected component (if applicable):

4.18.30

How reproducible:

Every time within customer environment (high impact)
Working on internal replicator now

Steps to Reproduce:

1. Deploy 4.18.30 cluster (baremetal observed, unclear if platform specific)

2. Create secondary interface using nncp to handle egressIP traffic

3. Create egressIP object

4. label namespace with matching label

5. create httpd test pod to curl from and observe intermittent connectivity failure reaching outbound - observe egressIP allocation swapping between nodes.

GRPC curls from node to node at port 9 give expected "connection refused" but this hasn't been exhaustively tested.

Actual results:

EgressIP flapping state, pod traffic interrupted during continued migrations

Expected results:

EgressIP should be stable/viable on 4.18.30

Additional info:

bond1 vrf interface appears to be flapping on these nodes - which is configured via nmstate operator (and seems stable according to NNCP object and nmstate handler pod). However, we see the interface is repeatedly being re-upped and the MAC is cycling. I expect this can and does lead to instable network conditions.

unclear why this is occurring. - see first comment below for specific data segments and log snips.

Assignee:: Ben Bennett

Reporter:: Will Russell

QA Contact:: Anurag Saxena

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Created:: 2026/01/05 10:58 PM

Updated:: 2026/01/07 11:15 PM

Resolved:: 2026/01/07 11:15 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates