Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-74136

IPsec state desynchronization after node reboot causes pod-to-pod connectivity loss and upgrade stall

XMLWordPrintable

    • None
    • False
    • Hide

      None

      Show
      None
    • None
    • Moderate
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem:

      During the cluster update from 4.18.30 to 4.19.21, pod-to-pod and node-to-node connectivity is lost following node reboots triggered by the MachineConfig operator. Although nodes return to the Ready state and pods remain running, encrypted traffic is dropped.  

      AuthServerRouteEndpointAccessibleControllerAvailable:
      Get "https://oauth.apps.<cluster-domain>/healthz": EOF 

      It looks like the root cause is a state desynchronization in the IPsec stack. When a node reboots, it re-initializes its IPsec Security Associations (SAs) with new Security Parameters Indexes (SPIs). However, peer nodes fail to synchronize immediately and continue attempting to send traffic using stale SPIs. This mismatch causes the kernel to drop encrypted packets, leading to EOF errors on authentication routes and stalling the upgrade.

      Restarting ovn-ipsec-host and ovnkube-node resolves the issue. 

      Version-Release number of selected component (if applicable): OpenShift 4.18.30 -> 4.19.21

      How reproducible:

      Try to upgrade the cluster from 4.18.30 to 4.19.21 while having IPSec activated. 

      Steps to Reproduce:

      • Enable IPsec encryption on an OVN-Kubernetes cluster.
      • Initiate a cluster upgrade or trigger a MachineConfig change that requires a rolling reboot of nodes.
      • Monitor connectivity to the OAuth or Console routes during the reboot cycle.

      Actual results:

      • Nodes reboot and return to Ready status.
      • IPsec xfrm state shows mismatched SPIs between the rebooted node and its peers.
      • Encrypted traffic is dropped; Ingress and OAuth routes become unreachable.
      • Error reported: AuthServerRouteEndpointAccessibleControllerAvailable: Get "https://oauth.apps.<domain>/healthz": EOF{{ }}
      • The upgrade hangs because Cluster Operators cannot verify health over the network.

       

      Additional info:

      The issue was detected on a VMWare UPI cluster. 

       

              bbennett@redhat.com Ben Bennett
              rh-ee-sstumpf Simon Stumpf
              None
              None
              Anurag Saxena Anurag Saxena
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: