Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-55453

pod to pod connectivity lost in 500/250 nodes IPSEC cluster (4.14 works, 4.19+ broken)

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • Yes
    • None
    • None
    • Rejected
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      Version-Release number of selected component (if applicable): 

      4.19.0-0.nightly-2025-04-24-005837

      4.19.0-0.nightly-2025-04-24-005837

      How reproducible: In 4.19 reproduced twice, once with 500 nodes(4.19.0-0.nightly-2025-04-24-005837), once with 250 worker nodes(4.19.0-0.nightly-2025-04-24-005837, 120 nodes has no issue)

      Steps to Reproduce:

      1. Install and scale an aws cluster with 3 masters: m5.8xlarge, 497 workers: m5.xlarge, 3 infra ndoes: c5.12xlarge. Move ingress, monitoring, registry to infra nodes.

      2. Check pod connectivity across nodes

      Actual results:

      Some pods in some nodes can not connect to other nodes

      For example:

      The pod hello-daemonset-22g7l on node 10.129.184.10  can not connect to node 10.130.6.9

      % oc exec -it hello-daemonset-22g7l -- bash
      bash-5.1$ curl --retry 3 --connect-timeout 2 10.130.6.9:8080
      curl: (28) Connection timeout after 2000 ms
      Warning: Problem : timeout. Will retry in 1 seconds. 3 retries left.
      curl: (28) Connection timeout after 2001 ms
      Warning: Problem : timeout. Will retry in 2 seconds. 2 retries left.
      curl: (28) Connection timeout after 2000 ms
      Warning: Problem : timeout. Will retry in 4 seconds. 1 retries left.
      curl: (28) Connection timeout after 2001 ms

      The pod hello-daemonset-22g7l on node 10.129.184.10  can connect to other nodes

      % oc exec -it hello-daemonset-22g7l -- bash 
      bash-5.1$ curl --retry 3 --connect-timeout 2 10.129.184.10:8080
      Hello OpenShift!

      An other pod hello-daemonset-zztcb on another node can connect to node 10.130.6.9

      % oc exec -it hello-daemonset-zztcb -- bash
      bash-5.1$ curl --retry 3 --connect-timeout 2 10.130.6.9:8080
      Hello OpenShift!

      pod-node-ip mapping

      % oc get po hello-daemonset-22g7l -o wide
      NAME                    READY   STATUS    RESTARTS   AGE    IP              NODE                                        NOMINATED NODE   READINESS GATES
      hello-daemonset-22g7l   1/1     Running   0          166m   10.129.184.10   ip-10-0-77-217.us-east-2.compute.internal   <none>           <none>
      
      % oc get po  -o wide | grep 10.130.6.9   
      hello-daemonset-zvj4x   1/1     Running   0          167m   10.130.6.9      ip-10-0-9-96.us-east-2.compute.internal     <none>           <none> 

      Checked the ipsec logs on node ip-10-0-9-96.us-east-2.compute.internal,  there are  half-loaded logs there. Full log is uploaded here

      2025-04-29T04:02:57Z | 8056| ovs-monitor-ipsec | INFO | Bringing up ipsec connection ovn-28bb2b-0-out-1
      2025-04-29T04:02:57Z | 8058| ovs-monitor-ipsec | INFO | ovn-7f9b0a-0-out-1 is defunct, removing
      2025-04-29T04:02:57Z | 8060| ovs-monitor-ipsec | INFO | ovn-7f9b0a-0-in-1 is half-loaded, removing

      Expected results:

      All pod in all nodes can connect to other nodes

      Additional info:

      More details of the test execution is in doc https://docs.google.com/document/d/1O14ZNA3Qs-3ObtMpZVJkVW_JfpDC6liptpZcIxl3u7M/ 

      Slack discussion: https://redhat-internal.slack.com/archives/C08DNAFC85T/p1745906230814439 

      All nodes to nodes check results is uploaded to: https://drive.google.com/file/d/1SwN4u4Q4896OOj6aWl9Oimi-od_LgiAK/view?usp=drive_link
      If you see a log file that means the the node can not be reached from a pod. For example: the below output means pod hello-daemonset-22g7l can't connect to pod 10.130.6.9

      hello-daemonset-22g7l
      -rw-r--r-- 1 1000740000 root    0 Apr 29 05:53 10.130.6.9.log

      pod-node-ip-mapping.txt

      Is it an

      1. internal RedHat testing failure

              sdn-team-bot sdn-team bot
              rhn-support-qili Qiujie Li
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

                Created:
                Updated: