-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.18.z
Description of problem:
- Recently upgraded cluster with multiple EgressIPs and allocated hosts are demonstrating duplicate NATs for a given pod, and duplicate hosts with the same EgressIP allocation. EgressIP output `oc get eip` lists only one host, but NAT entries on the nodes themselves show overlaps.
- Performed OVNKube DB rebuild as per documentation for 4.14+ db rebuild on all egress-assignable nodes
- Observed after rebuild, behavior persists, egressIP allocation still duplicated across multiple hosts and egress usage is impacted/unreliable. Pods are natted to node IP as well as egressIP inconsistently.
Version-Release number of selected component (if applicable):
- OpenShift 4.18.15
How reproducible:
- working on Internal Replication presently, but currently observed on one production cluster in customer environment.
Steps to Reproduce:
1. Deploy cluster with multiple egressIPs and multiple egress-capable host nodes
2. Upgrade to 4.18.15
3. Observe stale nat handling (issue documented further in bug: https://issues.redhat.com/browse/OCPBUGS-50709 ) .
4. Perform workaround outlined in KCS: https://access.redhat.com/solutions/7110252 .
5. Observe after DB rebuilds on nodes that behavior is NOT resolved, and issue persists - egressIP allocation may have migrated to new hosts, but stale entries are still present on multiple endpoints and traffic flow is blocked.
Actual results:
- EgressIP is not allocated properly/unreliable
Expected results:
- EgressIP should be consistently handled (Pending Remap of egressIP to it's own handler), But more importantly - cleanup tasking to rectify the condition should work consistently.
Additional info:
- Given that the known workaround of DB rebuild does NOT appear successful and egress IP handling is still impacted, filing as a new bug, rather than linking to the previous bug/epic for handling, because the behavior currently cannot be resolved. Need engineering assistance for identification of why egressIP allocation behavior is failing to be resolved by db rebuild.
- I am working on some better tooling to improve our detection of this behavior, and some possible methods for manual cleanup (TBD), but will need assistance from engineering on prioritizing this, which may involve escalating the epic for egress handler creation if no other workarounds can be identified.
- Customer is blocked from upgrading their clusters, and one cluster (currently) is impacted with ongoing impact/outage.
KCS created: https://access.redhat.com/solutions/7125049
- clones
-
OCPBUGS-57179 [4.20][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade
-
- Verified
-
- depends on
-
OCPBUGS-57179 [4.20][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade
-
- Verified
-
- is cloned by
-
OCPBUGS-59531 [4.18][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade
-
- Closed
-
- is depended on by
-
OCPBUGS-57179 [4.20][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade
-
- Verified
-
-
OCPBUGS-59531 [4.18][upgrade] OCP4.18.X - Stale SNATs/LRPs due to failed sync to add metadata after upgrade
-
- Closed
-
- links to
-
RHBA-2025:12341 OpenShift Container Platform 4.19.7 bug fix update