Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-38653

[4.17] EgressIP intermittent connection timeout while communicating with external services

XMLWordPrintable

    • Important
    • None
    • False
    • Hide

      None

      Show
      None
    • N/A
    • Release Note Not Required
    • Done
    • Customer Escalated

      Description of problem:

      - Pods that reside in a namespace utilizing EgressIP are experiencing intermittent TCP IO timeouts when attempting to communicate with external services.

      • Connection response while connecting external service from one of the pods:
        ❯ oc exec gitlab-runner-aj-02-56998875b-n6xxb -- bash -c 'while true; do timeout 3 bash -c "</dev/tcp/10.135.108.56/443" && echo "Connection success" || echo "Connection timeout"; sleep 0.5; done'
        Connection success
        Connection timeout
        Connection timeout
        Connection timeout
        Connection timeout
        Connection timeout
        Connection success
        Connection timeout
        Connection success 
      • The customer followed this solution https://access.redhat.com/solutions/7005481 and noticed an IP address in logical_router_policy nexthops that is not associated with any node.
        # Get pod node and podIP variable for the problematic pod 
        ❯ oc get pod gitlab-runner-aj-02-56998875b-n6xxb -ojson 2>/dev/null | jq -r '"\(.metadata.name) \(.spec.nodeName) \(.status.podIP)"' | read -r pod node podip
        
        # Find the ovn-kubernetes pod running on the same node as  gitlab-runner-aj-02-56998875b-n6xxb
        ❯ oc get pods -n openshift-ovn-kubernetes -lapp=ovnkube-node -ojson | jq --arg node "$node" -r '.items[] | select(.spec.nodeName == $node)| .metadata.name' | read -r ovn_pod
        
        # Collect each possible logical switch port address into variable LSP_ADDRESSES
        ❯ LSP_ADDRESSES=$(oc -n openshift-ovn-kubernetes exec ${ovn_pod} -it -c northd -- bash -c 'ovn-nbctl lsp-list transit_switch | while read guid name; do printf "%s " "${name}"; ovn-nbctl lsp-get-addresses "${guid}"; done')
        
        # List the logical router policy for the problematic pod
        ❯ oc -n openshift-ovn-kubernetes exec ${ovn_pod} -c northd -- ovn-nbctl find logical_router_policy match="\"ip4.src == ${podip}\""
        _uuid               : c55bec59-6f9a-4f01-a0b1-67157039edb8
        action              : reroute
        external_ids        : {name=gitlab-runner-caasandpaas-egress}
        match               : "ip4.src == 172.40.114.40"
        nexthop             : []
        nexthops            : ["100.88.0.22", "100.88.0.57"]
        options             : {}
        priority            : 100
        
        # Check whether each nexthop entry exists in the LSP addresses table
        ❯ echo $LSP_ADDRESSES | grep 100.88.0.22
        (tstor-c1nmedi01-9x2g9-worker-cloud-paks-m9t6b) 0a:58:64:58:00:16 100.88.0.22/16
        ❯ echo $LSP_ADDRESSES | grep 100.88.0.57 

         

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:

      1.

      2.

      3.

      Actual results:

      • Pods configured to use EgressIP face intermittent connection timeout while connecting to external services.

      Expected results:

      • The connection timeout should not happen.

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal CI failure
      2. customer issue / SD
      3. internal RedHat testing failure

      If it is an internal RedHat testing failure:

      • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

      If it is a CI failure:

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      • If it's a connectivity issue,
      • What is the srcNode, srcIP and srcNamespace and srcPodName?
      • What is the dstNode, dstIP and dstNamespace and dstPodName?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
      • Don’t presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
      • For guidance on using this template please see
        OCPBUGS Template Training for Networking  components

              pepalani@redhat.com Periyasamy Palanisamy
              rhn-support-akumawat Akshit Kumawat
              Huiran Wang Huiran Wang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: