Loading...

XML

Word

Printable

Type: Bug
Resolution: Unresolved
Priority: Normal
Fix Version/s: None
Affects Version/s: 4.19
Component/s: Networking / ovn-kubernetes
Labels:

Activity Type:
Quality / Stability / Reliability
Blocked:
False
Blocked Reason:

Hide

None

Show
None
Story Points:
None
Severity:
Critical
Regression:
Yes

Target Backport Versions:
None
Target Version:
None
Release Blocker:
Rejected
Sprint:
None

SFDC Cases Counter:
SFDC Cases Open:
SFDC Cases Links:

PX Review Complete:
PX Impact Score:

Release Note Status:
None
Release Note Type:
None
Release Note Text:
None

Escape Reason:
None
Escape Impact:
None
Corrective Measures:
None
SDLC stage when should've been found:
None

Description of problem:

Version-Release number of selected component (if applicable):

4.19.0-0.nightly-2025-04-24-005837

How reproducible: In 4.19 reproduced twice, once with 500 nodes(4.19.0-0.nightly-2025-04-24-005837), once with 250 worker nodes(4.19.0-0.nightly-2025-04-24-005837, 120 nodes has no issue)

Steps to Reproduce:

1. Install and scale an aws cluster with 3 masters: m5.8xlarge, 497 workers: m5.xlarge, 3 infra ndoes: c5.12xlarge. Move ingress, monitoring, registry to infra nodes.

2. Check pod connectivity across nodes

Actual results:

Some pods in some nodes can not connect to other nodes

For example:

The pod hello-daemonset-22g7l on node 10.129.184.10 can not connect to node 10.130.6.9

% oc exec -it hello-daemonset-22g7l -- bash
bash-5.1$ curl --retry 3 --connect-timeout 2 10.130.6.9:8080
curl: (28) Connection timeout after 2000 ms
Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.
curl: (28) Connection timeout after 2001 ms
Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left.
curl: (28) Connection timeout after 2000 ms
Warning: Problem : timeout. Will retry in 4 seconds. 1 retries left.
curl: (28) Connection timeout after 2001 ms

The pod hello-daemonset-22g7l on node 10.129.184.10 can connect to other nodes

% oc exec -it hello-daemonset-22g7l -- bash 
bash-5.1$ curl --retry 3 --connect-timeout 2 10.129.184.10:8080
Hello OpenShift!

An other pod hello-daemonset-zztcb on another node can connect to node 10.130.6.9

% oc exec -it hello-daemonset-zztcb -- bash
bash-5.1$ curl --retry 3 --connect-timeout 2 10.130.6.9:8080
Hello OpenShift!

pod-node-ip mapping

% oc get po hello-daemonset-22g7l -o wide
NAME                    READY   STATUS    RESTARTS   AGE    IP              NODE                                        NOMINATED NODE   READINESS GATES
hello-daemonset-22g7l   1/1     Running   0          166m   10.129.184.10   ip-10-0-77-217.us-east-2.compute.internal   <none>           <none>

% oc get po  -o wide | grep 10.130.6.9   
hello-daemonset-zvj4x   1/1     Running   0          167m   10.130.6.9      ip-10-0-9-96.us-east-2.compute.internal     <none>           <none>

Checked the ipsec logs on node ip-10-0-9-96.us-east-2.compute.internal, there are half-loaded logs there. Full log is uploaded here

2025-04-29T04:02:57Z | 8056| ovs-monitor-ipsec | INFO | Bringing up ipsec connection ovn-28bb2b-0-out-1
2025-04-29T04:02:57Z | 8058| ovs-monitor-ipsec | INFO | ovn-7f9b0a-0-out-1 is defunct, removing
2025-04-29T04:02:57Z | 8060| ovs-monitor-ipsec | INFO | ovn-7f9b0a-0-in-1 is half-loaded, removing

Expected results:

All pod in all nodes can connect to other nodes

Additional info:

More details of the test execution is in doc https://docs.google.com/document/d/1O14ZNA3Qs-3ObtMpZVJkVW_JfpDC6liptpZcIxl3u7M/

Slack discussion: https://redhat-internal.slack.com/archives/C08DNAFC85T/p1745906230814439

All nodes to nodes check results is uploaded to: https://drive.google.com/file/d/1SwN4u4Q4896OOj6aWl9Oimi-od_LgiAK/view?usp=drive_link
If you see a log file that means the the node can not be reached from a pod. For example: the below output means pod hello-daemonset-22g7l can't connect to pod 10.130.6.9

hello-daemonset-22g7l
-rw-r--r-- 1 1000740000 root    0 Apr 29 05:53 10.130.6.9.log

pod-node-ip-mapping.txt

Is it an

internal RedHat testing failure

kubeconfig:https://jenkins-csb-openshift-qe-mastern.dno.corp.redhat.com/job/ocp-common/job/Flexy-install/343850/artifact/workdir/install-dir/auth/kubeconfig

blocks

OCPBUGS-59303 revert libreswan pinning commit in 4.18 ovnk

MODIFIED

is blocked by

RHEL-89969 Duplicate Child SAs causing IPsec broken for OCP cluster

In Progress

CORENET-6196 Impact: pod to pod connectivity lost in 500/250 nodes IPSEC cluster

Closed

Assignee:: sdn-team bot

Reporter:: Qiujie Li

Need Info From:: None

Contributors:: None

QA Contact:: Sachin Ninganure

Doc Contact:: None

Votes:: 0 Vote for this issue

Watchers:: 15 Start watching this issue

Created:: 2025/04/29 6:26 AM

Updated:: 2025/11/17 4:06 PM

Details

Description

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates

Hide