-
Bug
-
Resolution: Unresolved
-
Normal
-
None
-
4.19
-
Quality / Stability / Reliability
-
False
-
-
None
-
Critical
-
Yes
-
None
-
None
-
Rejected
-
None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
Description of problem:
Version-Release number of selected component (if applicable):
4.19.0-0.nightly-2025-04-24-005837
4.19.0-0.nightly-2025-04-24-005837
How reproducible: In 4.19 reproduced twice, once with 500 nodes(4.19.0-0.nightly-2025-04-24-005837), once with 250 worker nodes(4.19.0-0.nightly-2025-04-24-005837, 120 nodes has no issue)
Steps to Reproduce:
1. Install and scale an aws cluster with 3 masters: m5.8xlarge, 497 workers: m5.xlarge, 3 infra ndoes: c5.12xlarge. Move ingress, monitoring, registry to infra nodes.
2. Check pod connectivity across nodes
Actual results:
Some pods in some nodes can not connect to other nodes
For example:
The pod hello-daemonset-22g7l on node 10.129.184.10 can not connect to node 10.130.6.9
% oc exec -it hello-daemonset-22g7l -- bash bash-5.1$ curl --retry 3 --connect-timeout 2 10.130.6.9:8080 curl: (28) Connection timeout after 2000 ms Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left. curl: (28) Connection timeout after 2001 ms Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left. curl: (28) Connection timeout after 2000 ms Warning: Problem : timeout. Will retry in 4 seconds. 1 retries left. curl: (28) Connection timeout after 2001 ms
The pod hello-daemonset-22g7l on node 10.129.184.10 can connect to other nodes
% oc exec -it hello-daemonset-22g7l -- bash bash-5.1$ curl --retry 3 --connect-timeout 2 10.129.184.10:8080 Hello OpenShift!
An other pod hello-daemonset-zztcb on another node can connect to node 10.130.6.9
% oc exec -it hello-daemonset-zztcb -- bash bash-5.1$ curl --retry 3 --connect-timeout 2 10.130.6.9:8080 Hello OpenShift!
pod-node-ip mapping
% oc get po hello-daemonset-22g7l -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES hello-daemonset-22g7l 1/1 Running 0 166m 10.129.184.10 ip-10-0-77-217.us-east-2.compute.internal <none> <none> % oc get po -o wide | grep 10.130.6.9 hello-daemonset-zvj4x 1/1 Running 0 167m 10.130.6.9 ip-10-0-9-96.us-east-2.compute.internal <none> <none>
Checked the ipsec logs on node ip-10-0-9-96.us-east-2.compute.internal, there are half-loaded logs there. Full log is uploaded here
2025-04-29T04:02:57Z | 8056| ovs-monitor-ipsec | INFO | Bringing up ipsec connection ovn-28bb2b-0-out-1 2025-04-29T04:02:57Z | 8058| ovs-monitor-ipsec | INFO | ovn-7f9b0a-0-out-1 is defunct, removing 2025-04-29T04:02:57Z | 8060| ovs-monitor-ipsec | INFO | ovn-7f9b0a-0-in-1 is half-loaded, removing
Expected results:
All pod in all nodes can connect to other nodes
Additional info:
More details of the test execution is in doc https://docs.google.com/document/d/1O14ZNA3Qs-3ObtMpZVJkVW_JfpDC6liptpZcIxl3u7M/
Slack discussion: https://redhat-internal.slack.com/archives/C08DNAFC85T/p1745906230814439
All nodes to nodes check results is uploaded to: https://drive.google.com/file/d/1SwN4u4Q4896OOj6aWl9Oimi-od_LgiAK/view?usp=drive_link
If you see a log file that means the the node can not be reached from a pod. For example: the below output means pod hello-daemonset-22g7l can't connect to pod 10.130.6.9
hello-daemonset-22g7l -rw-r--r-- 1 1000740000 root 0 Apr 29 05:53 10.130.6.9.log
Is it an
- internal RedHat testing failure
- blocks
-
OCPBUGS-59303 revert libreswan pinning commit in 4.18 ovnk
-
- MODIFIED
-
- is blocked by
-
RHEL-89969 Duplicate Child SAs causing IPsec broken for OCP cluster
-
- In Progress
-
-
CORENET-6196 Impact: pod to pod connectivity lost in 500/250 nodes IPSEC cluster
-
- Closed
-