[4.16] EgressIP intermittent connection timeout while communicating with external services

      * Previously, For egress IP, if an IP is assigned to an egress node and it is deleted, then pods selected by that `egressIP` may have incorrect routing information to that egress node. With this release, the issue is fixed. (link:https://issues.redhat.com/browse/OCPBUGS-38705[*OCPBUGS-38705*])
      For EgressIP, if an IP is assigned to an "egress node" and it is deleted, then pods selected by that EgressIP may have incorrect routing information to that egress node.
      Previously, For egress IP, if an IP is assigned to an egress node and it is deleted, then pods selected by that `egressIP` may have incorrect routing information to that egress node. With this release, the issue is fixed. (link: https://issues.redhat.com/browse/OCPBUGS-38705 [* OCPBUGS-38705 *]) ___________ For EgressIP, if an IP is assigned to an "egress node" and it is deleted, then pods selected by that EgressIP may have incorrect routing information to that egress node.
      Description of problem:

      - Pods that reside in a namespace utilizing EgressIP are experiencing intermittent TCP IO timeouts when attempting to communicate with external services.

      • Connection response while connecting external service from one of the pods:
        ❯ oc exec gitlab-runner-aj-02-56998875b-n6xxb -- bash -c 'while true; do timeout 3 bash -c "</dev/tcp/" && echo "Connection success" || echo "Connection timeout"; sleep 0.5; done'
        Connection success
        Connection timeout
        Connection timeout
        Connection timeout
        Connection timeout
        Connection timeout
        Connection success
        Connection timeout
        Connection success 
      • The customer followed this solution https://access.redhat.com/solutions/7005481 and noticed an IP address in logical_router_policy nexthops that is not associated with any node.
        # Get pod node and podIP variable for the problematic pod 
        ❯ oc get pod gitlab-runner-aj-02-56998875b-n6xxb -ojson 2>/dev/null | jq -r '"\(.metadata.name) \(.spec.nodeName) \(.status.podIP)"' | read -r pod node podip
        # Find the ovn-kubernetes pod running on the same node as  gitlab-runner-aj-02-56998875b-n6xxb
        ❯ oc get pods -n openshift-ovn-kubernetes -lapp=ovnkube-node -ojson | jq --arg node "$node" -r '.items[] | select(.spec.nodeName == $node)| .metadata.name' | read -r ovn_pod
        # Collect each possible logical switch port address into variable LSP_ADDRESSES
        ❯ LSP_ADDRESSES=$(oc -n openshift-ovn-kubernetes exec ${ovn_pod} -it -c northd -- bash -c 'ovn-nbctl lsp-list transit_switch | while read guid name; do printf "%s " "${name}"; ovn-nbctl lsp-get-addresses "${guid}"; done')
        # List the logical router policy for the problematic pod
        ❯ oc -n openshift-ovn-kubernetes exec ${ovn_pod} -c northd -- ovn-nbctl find logical_router_policy match="\"ip4.src == ${podip}\""
        _uuid               : c55bec59-6f9a-4f01-a0b1-67157039edb8
        action              : reroute
        external_ids        : {name=gitlab-runner-caasandpaas-egress}
        match               : "ip4.src =="
        nexthop             : []
        nexthops            : ["", ""]
        options             : {}
        priority            : 100
        # Check whether each nexthop entry exists in the LSP addresses table
        ❯ echo $LSP_ADDRESSES | grep
        (tstor-c1nmedi01-9x2g9-worker-cloud-paks-m9t6b) 0a:58:64:58:00:16
        ❯ echo $LSP_ADDRESSES | grep 


      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:




      Actual results:

      • Pods configured to use EgressIP face intermittent connection timeout while connecting to external services.

      Expected results:

      • The connection timeout should not happen.

      Additional info:

