-
Bug
-
Resolution: Done-Errata
-
Critical
-
4.14.z
-
+
-
Important
-
None
-
False
-
-
Release Note Not Required
-
In Progress
-
Customer Escalated
-
-
-
-
Description of problem:
Bare Metal UPI cluster Nodes lose communication with other nodes and this affects the pod communication on these nodes as well. This issue can be fixed with an OVN rebuild on the nodes db that are hitting the issue but eventually the nodes will degrade again and lose communication again. Note despite an OVN Rebuild fixing the issue temporarily Host Networking is set to True so it's using the kernel routing table. **update: observed on Vsphere with routingViaHost: false, ipForwarding: global configuration as well.
Version-Release number of selected component (if applicable):
4.14.7, 4.14.30
How reproducible:
Can't reproduce locally but reproducible and repeatedly occurring in customer environment
Steps to Reproduce:
identify a host node who's pods can't be reached from other hosts in default namespaces ( tested via openshift-dns). observe curls to that peer pod consistently timeout. TCPdumps to target pod observe that packets are arriving and are acknowledged, but never route back to the client pod successfully. (SYN/ACK seen at pod network layer, not at geneve; so dropped before hitting geneve tunnel).
Actual results:
Nodes will repeatedly degrade and lose communication despite fixing the issue with a ovn db rebuild (db rebuild only provides hours/days of respite, no permanent resolve).
Expected results:
Nodes should not be losing communication and even if they did it should not happen repeatedly
Additional info:
What's been tried so far ======================== - Multiple OVN rebuilds on different nodes (works but node will eventually hit issue again) - Flushing the conntrack (Doesn't work) - Restarting nodes (doesn't work) Data gathered ============= - Tcpdump from all interfaces for dns-pods going to port 7777 (to segregate traffic) - ovnkube-trace - SOSreports of two nodes having communication issues before an OVN rebuild - SOSreports of two nodes having communication issues after an OVN rebuild - OVS trace dumps of br-int and br-ex ==== More data in nested comments below.
- blocks
-
OCPBUGS-42780 Nodes to Node and subsequently pod to pod communication are repeatedly degrading despite multiple OVN DB rebuilds to fix the issue
- Closed
- depends on
-
OCPBUGS-42402 Nodes to Node and subsequently pod to pod communication are repeatedly degrading despite multiple OVN DB rebuilds to fix the issue
- Release Pending
- is blocked by
-
OCPBUGS-36404 Too many pending CSRs lead to scaleup failures when scaling to 500 nodes
- Verified
- is cloned by
-
OCPBUGS-42780 Nodes to Node and subsequently pod to pod communication are repeatedly degrading despite multiple OVN DB rebuilds to fix the issue
- Closed
-
OCPBUGS-42952 [4.14 IPSEC] pod to pod communication is degraded
- Closed
- relates to
-
FDP-846 ovs-monitor-ipsec can't proceed if 'ipsec auto' process is stuck
- Closed
- links to
-
RHBA-2024:7944 OpenShift Container Platform 4.16.z bug fix update