-
Bug
-
Resolution: Done-Errata
-
Major
-
4.14
Description of problem:
I opened this ticket to request a backport to 4.14, because we have a critical account that is having the same issue and is not able to upgrade past 4.14 in the near future.
This ticket is also based from https://issues.redhat.com/browse/OCPBUGS-44493 which has the main information about the missing LRPs noticed on the OVN DB of the egressIP assigned nodes. This new ticket is to divide issues.
Description taken from the first bug opened:
-Another problem, and this is the most problematic since it has been creating a lot of disruptions in the customer's connections, is the fact that such implementation is making OVN from time to time to get inconsistent LRPs in the OVN DBs of the nodes assigned with egressIPs. One example can be seen below:
$ for OVNNODE in ovnkube-node-8wlfp ovnkube-node-bsqb7 ovnkube-node-dqjl4 ovnkube-node-j8cd9; \
do echo "-----------------------------------------" ; \
echo "LRP on $OVNNODE" ; \
echo "------------------------------------------" ; \
oc -n openshift-ovn-kubernetes exec -c northd $OVNNODE – ovn-nbctl find logical_router_policy external_ids='{"name"="egress-agnhost-websrv"}' ; \
sleep 1; done
-------------------------------------------
LRP on ovnkube-node-8wlfp
-------------------------------------------
_uuid : 1506ff82-77ea-420a-bd8d-28bde56cfd33
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.14.28"
nexthop : []
nexthops : ["10.192.10.2"]
options : {}
priority : 100
_uuid : 67dce56c-c5ee-47a3-937c-5092346dab31
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.14.18"
nexthop : []
nexthops : ["10.192.10.2"]
options : {}
priority : 100
_uuid : 040cf3b9-bdf3-456e-9dee-24b0cdc62af4
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.12.24"
nexthop : []
nexthops : ["10.192.10.2"]
options : {}
priority : 100
_uuid : a2e3037c-9fc9-431a-a47e-536ab335da93
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.12.31"
nexthop : []
nexthops : ["10.192.10.2"]
options : {}
priority : 100
-------------------------------------------
LRP on ovnkube-node-bsqb7
-------------------------------------------
_uuid : 472b512d-fb7d-431e-bf0e-e83ebb30fff5
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.14.18"
nexthop : []
nexthops : ["100.88.0.3", "100.88.0.6"]
options : {}
priority : 100
_uuid : 1c7f114f-8475-44c0-a16c-6d930585963d
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.14.28"
nexthop : []
nexthops : ["100.88.0.3", "100.88.0.6"]
options : {}
priority : 100
-------------------------------------------
LRP on ovnkube-node-dqjl4
-------------------------------------------
_uuid : 267611cd-7e0f-4b0c-905d-086f71b9210c
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.12.24"
nexthop : []
nexthops : ["100.88.0.3"]
options : {}
priority : 100
_uuid : 91639887-b506-490e-9950-6fcaa6425a84
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.12.31"
nexthop : []
nexthops : ["100.88.0.3", "100.88.0.6"]
options : {}
priority : 100
-------------------------------------------
LRP on ovnkube-node-j8cd9
-------------------------------------------
_uuid : 14061ec0-cdcc-43b4-8c59-c0ccd1555510
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.14.18"
nexthop : []
nexthops : ["10.192.6.2"]
options : {}
priority : 100
_uuid : 80c1b9f9-46b0-4a4a-b2ee-e7e745640b28
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.12.31"
nexthop : []
nexthops : ["10.192.6.2"]
options : {}
priority : 100
_uuid : 0e90883d-d75d-4a47-b4c1-03b2adbba2e1
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.14.28"
nexthop : []
nexthops : ["10.192.6.2"]
options : {}
priority : 100
andregc@andregc-workpc:~$ ocpods
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
agnhost-https-server-5f85469745-46r9t 1/1 Running 0 163m 10.192.14.28 worker-1.prod-openshift4.redhatrules.local <none> <none>
agnhost-https-server-5f85469745-5qzxj 1/1 Running 0 163m 10.192.14.18 worker-1.prod-openshift4.redhatrules.local <none> <none>
agnhost-https-server-5f85469745-b4dhf 1/1 Running 0 169m 10.192.12.24 worker-2.prod-openshift4.redhatrules.local <none> <none>
support-tools-pod-555f8d887c-wjqtb 1/1 Running 0 163m 10.192.12.31 worker-2.prod-openshift4.redhatrules.local <none> <none>
$ oc describe eip/egress-agnhost-websrv
Name: egress-agnhost-websrv
Namespace:
Labels: <none>
Annotations: <none>
API Version: k8s.ovn.org/v1
Kind: EgressIP
Metadata:
Creation Timestamp: 2024-10-30T12:05:35Z
Generation: 44
Resource Version: 5425065
UID: 3e90f947-ae37-41e4-bb3b-9fbdcf3b3851
Spec:
Egress I Ps:
172.23.183.20
172.23.183.21
Namespace Selector:
Match Labels:
kubernetes.io/metadata.name: agnhost-websrv-testing
Pod Selector:
Status:
Items:
Egress IP: 172.23.183.20
Node: infra-0.prod-openshift4.redhatrules.local
Egress IP: 172.23.183.21
Node: infra-1.prod-openshift4.redhatrules.local
This issue happens mainly on the DBs of the egress nodes, but I have seen one time that for some reason the LRPs on the nodes where pods run instead of having the TS port addresses for the egress nodes, one nexthop had the ovn-k8s-mp0 address, something like:
_uuid : 91639887-b506-490e-9950-6fcaa6425a84
action : reroute
external_ids : {name=egress-agnhost-websrv}
match : "ip4.src == 10.192.12.31"
nexthop : []
nexthops : ["100.88.0.3", "10.192.6.2"]
options : {}
priority : 100
Any of the issues is only fixed when restarting the respective ovnkube-node pod.
To summarize the issue is not easy to reproduce and so far I can only see this issue with dual-stack setup. Alone, with just EIPs on additional networks I wasn't able to see the same issues with missing LRPs and node settings. Customer having EIPs mainly on IPv4 seems to have stabilized as well.
Another thing I could see is that somehow OVN is unable to keep their DBs updated if a re-assignment happens or some disruption happens in the egressIP helth checks. For example customer was having some issues with the ovn-control-planes marking node(s) out of the EIP assignment due to, for example, failed health checks, which setting reachabilityTotalTimeoutSeconds helped. I see ovnkube-controllers doing their normal job of avoiding stale entries, but perhaps is either missing ensuring the LRPs are created accordingly or are deleting LRPs and then missing some.
- clones
-
OCPBUGS-38705 [4.16] EgressIP intermittent connection timeout while communicating with external services
- Closed
- depends on
-
OCPBUGS-41340 [4.15] EgressIP intermittent connection timeout while communicating with external services
- Closed
- impacts account
-
OCPBUGS-44493 [OVN] IPv4 and IPv6 IPs in a single EIP should be able to be assigned to a single egress node
- New
-
OCPBUGS-41340 [4.15] EgressIP intermittent connection timeout while communicating with external services
- Closed
- links to
-
RHBA-2024:6818 OpenShift Container Platform 4.15.z bug fix update
-
RHBA-2024:10523 OpenShift Container Platform 4.14.z bug fix update