Loading...

XML

Word

Printable

Type: Bug
Resolution: Done-Errata
Priority: Critical
Fix Version/s: 4.17.z
Affects Version/s: 4.14.z, 4.15.z, 4.17.z, 4.16.z
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
3
Severity:
Important
Regression:
None

Target Backport Versions:

4.15.z, 4.16.z
Target Version:

4.17.z
Release Blocker:
None
Sprint:
OSDOCS Sprint 262, SDN Sprint 264
sprint_count:
2

Customer Impact:

Customer Escalated

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Priority Data:
PX Impact Score:

Release Note Status:
Done
Release Note Type:
Known Issue
Release Note Text:

Hide
* A regression in the behaviour of `libreswan` caused some nodes with IPsec enabled to lose communication with pods on other nodes in the same cluster. To resolve this issue, consider disabling IPsec for your cluster. (link:https://issues.redhat.com/browse/OCPBUGS-43714[*~~OCPBUGS-43714~~*])

Show
* A regression in the behaviour of `libreswan` caused some nodes with IPsec enabled to lose communication with pods on other nodes in the same cluster. To resolve this issue, consider disabling IPsec for your cluster. (link: https://issues.redhat.com/browse/OCPBUGS-43714 [* OCPBUGS-43714 *])

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Bare Metal UPI cluster

Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. 

**update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.

Version-Release number of selected component (if applicable):

 4.14.7, 4.14.30

How reproducible:

Can't reproduce locally but reproducible and repeatedly occurring in customer environment

Steps to Reproduce:

identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).

Actual results:

Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).

Expected results:

Nodes should not be losing communication and even if they did it should not happen repeatedly

Additional info:

What's been tried so far
========================

- Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again)

- Flushing the conntrack (Doesn't work)

- Restarting nodes (doesn't work)

Data gathered
=============

- Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic)

- ovnkube-trace

- SOSreports of two nodes having communication issues before an OVN rebuild

- SOSreports of two nodes having communication issues after an OVN rebuild 

- OVS trace dumps of br-int and br-ex 


====

More data in nested comments below.

linking KCS: https://access.redhat.com/solutions/7091399

clones

OCPBUGS-43713 [4.18 IPSEC] pod to pod communication is degraded

Closed

depends on

OCPBUGS-43713 [4.18 IPSEC] pod to pod communication is degraded

Closed

is cloned by

OCPBUGS-43715 [4.16 IPSEC] pod to pod communication is degraded

Closed

OCPBUGS-44659 Disabling IPsec encryption doc contains inaccurate note

Closed

is depended on by

OCPBUGS-43715 [4.16 IPSEC] pod to pod communication is degraded

Closed

is documented by

OCPBUGS-44672 Add pod to pod communication is degraded note to RN docs

Closed

links to

openshift/cluster-network-operator#2597: [release-4.17] OCPBUGS-43714: Skip including default crypto policies to avoid authby issue

openshift/openshift-docs#85047: OCPBUGS-43714: Added IPsec enabled node known issue to RNs]#

RHBA-2024:11522 OpenShift Container Platform 4.17.z bug fix update

(1 is documented by, 3 links to)

Assignee:: Periyasamy Palanisamy

Reporter:: Courtney Ruhm

QA Contact:: Huiran Wang

Need Info From:: None

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Created:: 2024/10/23 5:10 AM

Updated:: 2025/07/19 1:32 PM

Resolved:: 2025/07/07 2:44 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates