Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-44793

[4.14] EgressIP intermittent connection timeout while communicating with external services

XMLWordPrintable

    • Important
    • None
    • SDN Sprint 262, SDN Sprint 263
    • 2
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done

      Description of problem:

      I opened this ticket to request a backport to 4.14, because we have a critical account that is having the same issue and is not able to upgrade past 4.14 in the near future.

       

      This ticket is also based from https://issues.redhat.com/browse/OCPBUGS-44493 which has the main information about the missing LRPs noticed on the OVN DB of the egressIP assigned nodes. This new ticket is to divide issues.

       

      Description taken from the first bug opened:

       

      -Another problem, and this is the most problematic since it has been creating a lot of disruptions in the customer's connections, is the fact that such implementation is making OVN from time to time to get inconsistent LRPs in the OVN DBs of the nodes assigned with egressIPs. One example can be seen below:

      $ for OVNNODE in ovnkube-node-8wlfp ovnkube-node-bsqb7 ovnkube-node-dqjl4 ovnkube-node-j8cd9; \

      do echo "-----------------------------------------" ; \

      echo "LRP on $OVNNODE"  ; \

      echo "------------------------------------------" ; \

      oc -n openshift-ovn-kubernetes exec -c northd $OVNNODE – ovn-nbctl find logical_router_policy external_ids='{"name"="egress-agnhost-websrv"}' ; \

      sleep 1; done
      -------------------------------------------
      LRP on ovnkube-node-8wlfp
      -------------------------------------------
      _uuid               : 1506ff82-77ea-420a-bd8d-28bde56cfd33
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.14.28"
      nexthop             : []
      nexthops            : ["10.192.10.2"]
      options             : {}
      priority            : 100

      _uuid               : 67dce56c-c5ee-47a3-937c-5092346dab31
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.14.18"
      nexthop             : []
      nexthops            : ["10.192.10.2"]
      options             : {}
      priority            : 100

      _uuid               : 040cf3b9-bdf3-456e-9dee-24b0cdc62af4
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.12.24"
      nexthop             : []
      nexthops            : ["10.192.10.2"]
      options             : {}
      priority            : 100

      _uuid               : a2e3037c-9fc9-431a-a47e-536ab335da93
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.12.31"
      nexthop             : []
      nexthops            : ["10.192.10.2"]
      options             : {}
      priority            : 100
      -------------------------------------------
      LRP on ovnkube-node-bsqb7
      -------------------------------------------
      _uuid               : 472b512d-fb7d-431e-bf0e-e83ebb30fff5
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.14.18"
      nexthop             : []
      nexthops            : ["100.88.0.3", "100.88.0.6"]
      options             : {}
      priority            : 100

      _uuid               : 1c7f114f-8475-44c0-a16c-6d930585963d
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.14.28"
      nexthop             : []
      nexthops            : ["100.88.0.3", "100.88.0.6"]
      options             : {}
      priority            : 100
      -------------------------------------------
      LRP on ovnkube-node-dqjl4
      -------------------------------------------
      _uuid               : 267611cd-7e0f-4b0c-905d-086f71b9210c
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.12.24"
      nexthop             : []
      nexthops            : ["100.88.0.3"]
      options             : {}
      priority            : 100

      _uuid               : 91639887-b506-490e-9950-6fcaa6425a84
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.12.31"
      nexthop             : []
      nexthops            : ["100.88.0.3", "100.88.0.6"]
      options             : {}
      priority            : 100
      -------------------------------------------
      LRP on ovnkube-node-j8cd9
      -------------------------------------------
      _uuid               : 14061ec0-cdcc-43b4-8c59-c0ccd1555510
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.14.18"
      nexthop             : []
      nexthops            : ["10.192.6.2"]
      options             : {}
      priority            : 100

      _uuid               : 80c1b9f9-46b0-4a4a-b2ee-e7e745640b28
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.12.31"
      nexthop             : []
      nexthops            : ["10.192.6.2"]
      options             : {}
      priority            : 100

      _uuid               : 0e90883d-d75d-4a47-b4c1-03b2adbba2e1
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.14.28"
      nexthop             : []
      nexthops            : ["10.192.6.2"]
      options             : {}
      priority            : 100
      andregc@andregc-workpc:~$ ocpods
      NAME                                    READY   STATUS    RESTARTS   AGE    IP             NODE                                         NOMINATED NODE   READINESS GATES
      agnhost-https-server-5f85469745-46r9t   1/1     Running   0          163m   10.192.14.28   worker-1.prod-openshift4.redhatrules.local   <none>           <none>
      agnhost-https-server-5f85469745-5qzxj   1/1     Running   0          163m   10.192.14.18   worker-1.prod-openshift4.redhatrules.local   <none>           <none>
      agnhost-https-server-5f85469745-b4dhf   1/1     Running   0          169m   10.192.12.24   worker-2.prod-openshift4.redhatrules.local   <none>           <none>
      support-tools-pod-555f8d887c-wjqtb      1/1     Running   0          163m   10.192.12.31   worker-2.prod-openshift4.redhatrules.local   <none>           <none>

      $ oc describe eip/egress-agnhost-websrv
      Name:         egress-agnhost-websrv
      Namespace:    
      Labels:       <none>
      Annotations:  <none>
      API Version:  k8s.ovn.org/v1
      Kind:         EgressIP
      Metadata:
        Creation Timestamp:  2024-10-30T12:05:35Z
        Generation:          44
        Resource Version:    5425065
        UID:                 3e90f947-ae37-41e4-bb3b-9fbdcf3b3851
      Spec:
        Egress I Ps:
          172.23.183.20
          172.23.183.21
        Namespace Selector:
          Match Labels:
            kubernetes.io/metadata.name:  agnhost-websrv-testing
        Pod Selector:
      Status:
        Items:
          Egress IP:  172.23.183.20
          Node:       infra-0.prod-openshift4.redhatrules.local
          Egress IP:  172.23.183.21
          Node:       infra-1.prod-openshift4.redhatrules.local

       

      This issue happens mainly on the DBs of the egress nodes, but I have seen one time that for some reason the LRPs on the nodes where pods run instead of having the TS port addresses for the egress nodes, one nexthop had the ovn-k8s-mp0  address, something like:

       

      _uuid               : 91639887-b506-490e-9950-6fcaa6425a84
      action              : reroute
      external_ids        : {name=egress-agnhost-websrv}
      match               : "ip4.src == 10.192.12.31"
      nexthop             : []
      nexthops            : ["100.88.0.3", "10.192.6.2"]
      options             : {}
      priority            : 100

       

      Any of the issues is only fixed when restarting the respective ovnkube-node pod.

      To summarize the issue is not easy to reproduce and so far I can only see this issue with dual-stack setup. Alone, with just EIPs on additional networks I wasn't able to see the same issues with missing LRPs and node settings. Customer having EIPs mainly on IPv4 seems to have stabilized as well.

      Another thing I could see is that somehow OVN is unable to keep their DBs updated if a re-assignment happens or some disruption happens in the egressIP helth checks. For example customer was having some issues with the ovn-control-planes marking node(s) out of the EIP assignment due to, for example, failed health checks, which setting reachabilityTotalTimeoutSeconds helped. I see ovnkube-controllers doing their normal job of avoiding stale entries, but perhaps is either missing ensuring the LRPs are created accordingly or are deleting LRPs and then missing some. 

              pepalani@redhat.com Periyasamy Palanisamy
              rhn-support-andcosta Andre Costa
              Huiran Wang Huiran Wang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

                Created:
                Updated:
                Resolved: