Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-57433

[4.20] OCP 4.18.15 - EgressIP appears to remove and re-allocate endpoints periodically leading to packet loss (no egressIP migration between hosts observed)

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • None
    • None
    • CORENET Sprint 272, CORENET Sprint 273
    • 2
    • Done
    • Bug Fix
    • Hide
      * Before this update, the Ingress Operator added resources, most noteably gateway resources, to the `status.relatedObjects` parameter of the Cluster Operator even if the CRDs for those resources did not exist. Additionally, the Ingress Operator specified a namespace for the `istios` and `GatewayClass`resources, which are both cluster-scoped resources. As a result of these configurations, the `relatedObjects` parameter contained misleading information. With this release, an update to the status controller of the Ingress Operator ensures the controller checks if these resources already exist and also checks the related feature gates before adding any of these resources to the `relatedObjects` parameter . The controller no longer specifies namespaces for the `GatewayClass` and `istio` resources. This update ensures that the `relatedObjects` parameter contains accurate information for the `GatewayClass` and `istio` resources. (link:https://issues.redhat.com/browse/OCPBUGS-57433[OCPBUGS-57433])
      Show
      * Before this update, the Ingress Operator added resources, most noteably gateway resources, to the `status.relatedObjects` parameter of the Cluster Operator even if the CRDs for those resources did not exist. Additionally, the Ingress Operator specified a namespace for the `istios` and `GatewayClass`resources, which are both cluster-scoped resources. As a result of these configurations, the `relatedObjects` parameter contained misleading information. With this release, an update to the status controller of the Ingress Operator ensures the controller checks if these resources already exist and also checks the related feature gates before adding any of these resources to the `relatedObjects` parameter . The controller no longer specifies namespaces for the `GatewayClass` and `istio` resources. This update ensures that the `relatedObjects` parameter contains accurate information for the `GatewayClass` and `istio` resources. (link: https://issues.redhat.com/browse/OCPBUGS-57433 [ OCPBUGS-57433 ])
    • None
    • None
    • None
    • None

      Description of problem:

      • Observed the following behavior in the logs for a given ovnkube-node host running OVNkubernetes:
        •  running Pods seem to not use their assigned Egress IP for very short time intervals.
          Example:
          The firewall logs show the following dropped packages:
          2025-06-11 09:06:05    139.23.166.13  139.23.81.244    1521  drop
          2025-06-11 09:06:05    139.23.166.13  139.23.81.244    1521  drop
          
          At the same time the following log entries are produced by the ovnkube-controller container:[...]
          
          Node (demchdc253x): I0611 09:06:05.802359 2103837 egressip.go:869] Adding pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: di-pp/bisrte-bisout-745898fc95-mncvv/[10.195.163.47/24]
          Node (demchdc255x): I0611 09:06:05.800283 4127154 egressip.go:869] Adding pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: di-pp/bisrte-bisout-745898fc95-mncvv/[10.195.163.47/32]
          Node (demchdc253x): I0611 09:06:05.800775 2103837 egressip.go:1042] Deleting pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: bisrte-bisout-745898fc95-mncvv/di-pp
          Node (demchdc255x): I0611 09:06:05.799092 4127154 egressip.go:1042] Deleting pod egress IP status: {demchdc255x 139.23.166.51} for EgressIP: egress-di-pp and pod: bisrte-bisout-745898fc95-mncvv/di-pp
          [...]
          
          After this interlude, traffic continues to use the correct Egress IP.I found many instances where this only happens once for a Pod but also some instances where this happens multiple times during the runtime of a Pod. 
      • behavior is observed on all ovnkube-node pods across multiple egressIPS.
      • No external sync issue with argocd (manages the egress objects) observed argo logs and access logs show no actions taken on the objects, seems very intermittent and unpredictable.

      Version-Release number of selected component (if applicable):

      How reproducible:

      • Working on internal replicator now - unclear how easy to replicate
      •  

      Steps to Reproduce:

      1. deploy cluster on 4.1.15 - apply multple egressIPs across multiple namespaces

      2. observe pods cycling the egressiP state in the logs periodically/dropping packets as a result

      3. observe egressIPs are NOT moved to new nodes and are consistently available.

      Actual results:

      • egressIP handling is intermittently unavailable
      •  

      Expected results:

      • consistency expected
      •  

      Additional info:

      • this cluster was recently viewed/fixed using steps outlined in: https://access.redhat.com/solutions/7125049 and https://issues.redhat.com/browse/OCPBUGS-57179. OVNkube DB's have been rebuilt - and we are seeing this after this process so these are clean "new" egressIP's that the cluster is flapping.
      • Customer is using ipsec for these nodes - possibly a factor?
      • data available in first comment below.
      • proactively tagging mkennell@redhat.com about this since it's the same customer/cluster but filing in a separate bug to ensure it's not conflated with the previous issue.

       

              mkennell@redhat.com Martin Kennelly
              rhn-support-wrussell Will Russell
              None
              Martin Kennelly
              Huiran Wang Huiran Wang
              None
              Votes:
              1 Vote for this issue
              Watchers:
              19 Start watching this issue

                Created:
                Updated: