Loading...

XML

Word

Printable

Type: Bug
Resolution: Not a Bug
Priority: Undefined
Fix Version/s: None
Affects Version/s: 4.16.z
Component/s: Networking / ovn-kubernetes
Labels:
None

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
None

Target Backport Versions:
None
Target Version:
None
Release Blocker:
None
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Primary problem: Openshift router-default pods (HostNetworked, running on worker nodes) cannot reach openshift-authentication (oauth) pods running on master nodes. (connection timeout).

OVNKUBE-trace indicates successful flows in both directions.

Replicated traffic flow blocking issue between all workers and all masters (vsphere infrastructure) on the same Vhost.

No Network Policies present, ovnkube DB rebuild on all hosts makes no difference.

Worker to worker is unobstructed, master to master is unobstructed, worker to master is blocked (or vice versa). See below for confirmed flow issues:

cross-pools not OK:
hostNetwork (worker) --> endpointIP (master) (fails)
podNetwork (worker) --> endpointIP (master) (fails)
hostNetwork (worker) --> ServiceIP --> endpoint(master) (succeeds) !!INTERESTING!!
podNetwork (worker) --> serviceIP --> endpoint (master) (fails)
---

#same pools are okay:
HostNetwork (worker) --> endpointIP (worker) (success)
PodNetwork (worker) --> endpointIP (worker) (success)
HostNetwork (master) --> endpointIP (master) (success)
PodNetwork (master) --> endpointIP (master) (success) 

# host Net OKAY:
worker PING or CURL to masterIP:6443 (success)
Master PING or CURL to workerIP:443 (success)

Version-Release number of selected component (if applicable):

4.16.46

How reproducible:

Continuously - ongoing in single customer environment, unable to replicate internally.

Steps to Reproduce:

//timeline of events:
 //TIMELINE OF EVENTS:

4.12 cluster created (SDN)

4.14 upgraded successfully

SDN --> OVN (completed with some issues)

ALL WORKER NODES REPLACED WITH REINSTALLED NODES WITH EXPANDED SUBNETS BECAUSE PREVIOUSLY THE CIDR WAS TOO SMALL. Rather than reinstalling the cluster, removed + installed a fresh replacement host one at a time + added it to the cluster with expanded subnet. (Masters were NOT replaced)

(20d uptime healthy)

5d back:
4.14 --> 15 (masters only)

4.15 --> 4.16 (masters and workers) [problem state started]

Actual results:

Customer platform is degraded

Expected results:

Additional info:

Cloudpaks cluster, IBM support engaged
TCPDUMP has been pulled and we observe that in the failure condition state the SYN packet from the router pod (or client on the worker) is recieved by the target application oauth container and a syn/ack is generated. However, the syn/ack is NOT seen by the client side on any interface.
No NetworkPolicies in place
No nftable or IPtable rule modifications on any host
No Firewall between the nodes or traffic shaping has been observed
6081/UDP traffic is unobstructed that we can tell.
Ovnkube-traces are available (will upload separately) that indicate successful flow
ovnkube db rebuilds or node reboots do not make a difference
ntp time is synced on hosts
all VMS have been moved to the same esxi hardware to confirm local routing is okay/unblocked (no change)

The most unique behavior is that when we curl from the router pod (hostNet) to the SERVICE IP (internal) of the oauth pods in openshift-authentication, we get a 200. If we call the endpoint directly at the pod exposed port, we fail the call.

question for engineering: What is different about this flow? We should be natting the request through the 100.88.0.0/16 or 100.64.0.0/16 subnet in either event, perhaps source nat being the difference - serviceIP would register to the pod as the client IP address for return path, and on a direct call from hostnet would register as the IP of the node?

//analysis + suspicion:

I expect there is a problem with return pathing here where we're sending the packet out into the network caused by a configuration issue at the gateway/switch, but I can't rule out the possibility that there's a routing/handling issue in the ovnkube routing tables that requires a second opinion.

//Data gathered + compiled and available in first private comment in the bug below.

Assignee:: Ben Bennett

Reporter:: Will Russell

Need Info From:: None

Contributors:: None

QA Contact:: Anurag Saxena

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Created:: 2025/08/28 6:19 PM

Updated:: 2025/10/08 12:55 PM

Resolved:: 2025/09/02 4:47 PM

Details

Description

Attachments

Easy Agile Planning Poker

Activity

People

Dates

Hide