Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-59530

[4.19][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade

XMLWordPrintable

    • Incidents & Support
    • False
    • Hide

      None

      Show
      None
    • None
    • Critical
    • Yes
    • None
    • CORENET Sprint 272, CORENET Sprint 273, CORENET Sprint 274
    • 3
    • Customer Escalated, Customer Facing
    • Done
    • Known Issue
    • Hide
      * Before this update, a cluster upgrade to {product-title} 4.18 caused inconsistent egress IP allocation due to stale Network Address Translation (NAT) handling. This issue occurred only when you deleted an egress IP pod while the OVN-Kubernetes controller for an egress node was down. As a consequence, duplicate Logical Router Policies and egress IP usage occurred, which caused inconsistent traffic flow and outage. With this release, egress IP allocation cleanup ensures consistent and reliable egress IP allocation in {product-title} 4.18 clusters. (link:https://issues.redhat.com/browse/OCPBUGS-59530[OCPBUGS-59530])
      Show
      * Before this update, a cluster upgrade to {product-title} 4.18 caused inconsistent egress IP allocation due to stale Network Address Translation (NAT) handling. This issue occurred only when you deleted an egress IP pod while the OVN-Kubernetes controller for an egress node was down. As a consequence, duplicate Logical Router Policies and egress IP usage occurred, which caused inconsistent traffic flow and outage. With this release, egress IP allocation cleanup ensures consistent and reliable egress IP allocation in {product-title} 4.18 clusters. (link: https://issues.redhat.com/browse/OCPBUGS-59530 [ OCPBUGS-59530 ])
    • None

      Description of problem:

      • Recently upgraded cluster with multiple EgressIPs and allocated hosts are demonstrating duplicate NATs for a given pod, and duplicate hosts with the same EgressIP allocation. EgressIP output `oc get eip` lists only one host, but NAT entries on the nodes themselves show overlaps.
      • Performed OVNKube DB rebuild as per documentation for 4.14+ db rebuild on all egress-assignable nodes
      • Observed after rebuild, behavior persists, egressIP allocation still duplicated across multiple hosts and egress usage is impacted/unreliable. Pods are natted to node IP as well as egressIP inconsistently. 

      Version-Release number of selected component (if applicable):

      • OpenShift 4.18.15

      How reproducible:

      • working on Internal Replication presently, but currently observed on one production cluster in customer environment.

      Steps to Reproduce:

      1. Deploy cluster with multiple egressIPs  and multiple egress-capable host nodes

      2. Upgrade to 4.18.15

      3. Observe stale nat handling (issue documented further in bug: https://issues.redhat.com/browse/OCPBUGS-50709 ) .

      4. Perform workaround outlined in KCS: https://access.redhat.com/solutions/7110252 .

      5. Observe after DB rebuilds on nodes that behavior is NOT resolved, and issue persists - egressIP allocation may have migrated to new hosts, but stale entries are still present on multiple endpoints and traffic flow is blocked.

      Actual results:

      • EgressIP is not allocated properly/unreliable

      Expected results:

      • EgressIP should be consistently handled (Pending Remap of egressIP to it's own handler), But more importantly - cleanup tasking to rectify the condition should work consistently.

      Additional info:

      • Given that the known workaround of DB rebuild does NOT appear successful and egress IP handling is still impacted, filing as a new bug, rather than linking to the previous bug/epic for handling, because the behavior currently cannot be resolved. Need engineering assistance for identification of why egressIP allocation behavior is failing to be resolved by db rebuild.
      • I am working on some better tooling to improve our detection of this behavior, and some possible methods for manual cleanup (TBD), but will need assistance from engineering on prioritizing this, which may involve escalating the epic for egress handler creation if no other workarounds can be identified.
      • Customer is blocked from upgrading their clusters, and one cluster (currently) is impacted with ongoing impact/outage.

       

      KCS created: https://access.redhat.com/solutions/7125049

              mkennell@redhat.com Martin Kennelly
              rhn-support-wrussell Will Russell
              None
              None
              Jean Chen Jean Chen
              None
              Surya Seetharaman
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Created:
                Updated:
                Resolved: