Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-51349

OCP 4.14.44 - Transmission failure between host nodes using IPSEC (specific node)

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Won't Do
    • Icon: Undefined Undefined
    • None
    • 4.14.z
    • None
    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      OpenShift 4.14.44 cluster running IPSEC on OVNKubernetes
      - Observing that certain nodes in the cluster are unable to communicate with one another as expected - packets are dropped (or lost) in transit and are not answered by peers.
      
      -  Tested using IPSEC validation script to query between nodes [1] and observed that we are unable to successfully call peer pods (openshift-dns pod to openshift-dns pod across hosts, expected unobstructed call, requires geneve + ipsec tunnel encapsulation. Succeeds from all nodes to all nodes but one, and on that one inpacted node, fails to call about half of the nodes in the cluster.
      
      - We confirmed that the IC-subnet range for the nodes are valid (100.96.0.0/16 including on impacted node)
      
      - We restarted the ovnkube-node pods on all hosts, no change
      
      - Validated to the best of our ability that IPSEC handshakes look good and normal but pulled sampling from these host nodes + sosreports for review. 
      
      - Need assistance confirming that OVNKube and IPSEC flows are working properly.

       

      Version-Release number of selected component (if applicable):

      4.14.44 

      How reproducible:

      - replicated on two clusters (customer envs), but on one cluster rebooting the host node appeared to mitigate the behavior for a few days before it came back
      - left the other cluster impacted for diagnostics purposes. 

      Steps to Reproduce:

      - Unknown
      - 4.14.44 cluster with manually migrated IC-subnet value (100.96.0.0/16)
      - IPSEC defined at install time
      - Vsphere
      - pod subnet overlap in 100.80.0.0/12 subnet requires IC-migration
       

      Actual results:

      Pod to pod communication failure between impacted host and neighboring node. 

      Expected results:

       communication should not be blocked on 4.14.44 - we are upstream of both IPSEC and OVNKUBE handling issues that previously impacted communication as outlined in: 
      
      https://access.redhat.com/solutions/7091399
      https://access.redhat.com/solutions/7088635
      https://access.redhat.com/solutions/7103865

      Additional info:

      - Additional data and uploads in next comments (internal). 
      - Possibly related to: https://issues.redhat.com/browse/OCPBUGS-42616 (?)

              sdn-team-bot sdn-team bot
              rhn-support-wrussell Will Russell
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              1 Vote for this issue
              Watchers:
              8 Start watching this issue

                Created:
                Updated:
                Resolved: