Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59234

[4.19] OCP 4.18.15 - EgressIP appears to remove and re-allocate endpoints periodically leading to packet loss (no egressIP migration between hosts observed)

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • CORENET Sprint 272, CORENET Sprint 273, CORENET Sprint 274
    • 3
    • Done
    • Bug Fix
    • Hide
      * Before this update, intermittent egress internet protocol (IP) handling due to inconsistent state updates in `OVNkubernetes`caused packet drops. These packet drops affected network traffic flow. With this release, `OVNkubernetes`pods consistently use their assigned egress IPs, As a result, dropped packages are reduced and network traffic flow is improved. (link:https://issues.redhat.com/browse/OCPBUGS-59234[OCPBUGS-59234])
      Show
      * Before this update, intermittent egress internet protocol (IP) handling due to inconsistent state updates in `OVNkubernetes`caused packet drops. These packet drops affected network traffic flow. With this release, `OVNkubernetes`pods consistently use their assigned egress IPs, As a result, dropped packages are reduced and network traffic flow is improved. (link: https://issues.redhat.com/browse/OCPBUGS-59234 [ OCPBUGS-59234 ])
    • None
    • None
    • None
    • None

      Description of problem:

      • Observed the following behavior in the logs for a given ovnkube-node host running OVNkubernetes:
        •  running Pods seem to not use their assigned Egress IP for very short time intervals.
          Example:
          The firewall logs show the following dropped packages:
          2025-06-11 09:06:05    139.23.166.13  139.23.81.244    1521  drop
          2025-06-11 09:06:05    139.23.166.13  139.23.81.244    1521  drop
          
          At the same time the following log entries are produced by the ovnkube-controller container:[...]
          
          Node (demchdc253x): I0611 09:06:05.802359 2103837 egressip.go:869] Adding pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: di-pp/bisrte-bisout-745898fc95-mncvv/[10.195.163.47/24]
          Node (demchdc255x): I0611 09:06:05.800283 4127154 egressip.go:869] Adding pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: di-pp/bisrte-bisout-745898fc95-mncvv/[10.195.163.47/32]
          Node (demchdc253x): I0611 09:06:05.800775 2103837 egressip.go:1042] Deleting pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: bisrte-bisout-745898fc95-mncvv/di-pp
          Node (demchdc255x): I0611 09:06:05.799092 4127154 egressip.go:1042] Deleting pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: bisrte-bisout-745898fc95-mncvv/di-pp
          [...]
          
          After this interlude, traffic continues to use the correct Egress IP.I found many instances where this only happens once for a Pod but also some instances where this happens multiple times during the runtime of a Pod. 
      • behavior is observed on all ovnkube-node pods across multiple egressIPS.
      • No external sync issue with argocd (manages the egress objects) observed argo logs and access logs show no actions taken on the objects, seems very intermittent and unpredictable.

      Version-Release number of selected component (if applicable):

      How reproducible:

      • Working on internal replicator now - unclear how easy to replicate
      •  

      Steps to Reproduce:

      1. deploy cluster on 4.1.15 - apply multple egressIPs across multiple namespaces

      2. observe pods cycling the egressiP state in the logs periodically/dropping packets as a result

      3. observe egressIPs are NOT moved to new nodes and are consistently available.

      Actual results:

      • egressIP handling is intermittently unavailable
      •  

      Expected results:

      • consistency expected
      •  

      Additional info:

      • this cluster was recently viewed/fixed using steps outlined in: https://access.redhat.com/solutions/7125049 and https://issues.redhat.com/browse/OCPBUGS-57179. OVNkube DB's have been rebuilt - and we are seeing this after this process so these are clean "new" egressIP's that the cluster is flapping.
      • Customer is using ipsec for these nodes - possibly a factor?
      • data available in first comment below.
      • proactively tagging mkennell@redhat.com about this since it's the same customer/cluster but filing in a separate bug to ensure it's not conflated with the previous issue.

       

              mkennell@redhat.com Martin Kennelly
              rhn-support-wrussell Will Russell
              None
              Martin Kennelly
              Huiran Wang Huiran Wang
              None
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Created:
                Updated:
                Resolved: