Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48335

[EIP UDN layer3 pre-merge testing] in LGW mode, after egressIP is deleted, egress packets from local or remote EIP pod can not be captured on the pod's host

XMLWordPrintable

    • Quality / Stability / Reliability
    • False
    • Hide

      None

      Show
      None
    • None
    • Important
    • No
    • None
    • Proposed
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Description of problem: [EIP UDN layer3 pre-merge testing] in LGW mode, after egressIP is deleted, egress packets from local or remote EIP pod can not be captured on the pod's host

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:

      1. labeled a node to be egress node

       

      $ oc label node jechen-udn-eip-6sv2r-worker-a-jbvmn k8s.ovn.org/egress-assignable=true
      node/jechen-udn-eip-nx74w-worker-a-mk97x labeled

       

      2. created a namespace, labeled it to match namespaceSelector of egressIP object that will be created in step 3, created a layer UDN in the namespce

       

      $ oc get ns test --show-labels 
      NAME   STATUS   AGE   LABELS
      test   Active   36m   kubernetes.io/metadata.name=test,pod-security.kubernetes.io/audit-version=latest,pod-security.kubernetes.io/audit=restricted,pod-security.kubernetes.io/enforce-version=latest,pod-security.kubernetes.io/enforce=restricted,pod-security.kubernetes.io/warn-version=latest,pod-security.kubernetes.io/warn=restricted,team=red

      $ cat udn-layer3-ns-test.yaml

      kind: List
      apiVersion: v1
      metadata: {}
      items:

      • apiVersion: k8s.ovn.org/v1
          kind: UserDefinedNetwork
          metadata:
            name: l3-network-test
            namespace: test 
          spec:
            layer3:
              mtu: 1300
              role: Primary
              subnets:
              - cidr: 10.150.0.0/16
                hostSubnet: 24
            topology: Layer3

       

      $ oc get userdefinednetwork 
      NAME              AGE
      l3-network-test   35m

       

      3. Picked an unused IP from the same subnet of the egress node, created an egressIP object, waited till egressIP is assigned onto the egress node

       

      $ cat config_egressip1_ovn_ns_team_red_gcp.yaml
      apiVersion: k8s.ovn.org/v1
      kind: EgressIP
      metadata:
        name: egressip-red
      spec:
        egressIPs:
        - 10.0.128.101
        namespaceSelector:
          matchLabels:
            team: red 

       

      $ oc get egressips.k8s.ovn.org 
      NAME           EGRESSIPS      ASSIGNED NODE                         ASSIGNED EGRESSIPS
      egressip-red   10.0.128.101   jechen-udn-eip-6sv2r-worker-a-jbvmn   10.0.128.101

      4. created multiple test pods that are local or remote to the egress node, curled external from these test pods, was able to verify egressIP was used as sourceIP of egress packets from these pods before deleting egressIP object

       

      $ oc get pod -owide
      NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                  NOMINATED NODE   READINESS GATES
      test-rc-cj78b   1/1     Running   0          37m   10.131.0.10   jechen-udn-eip-6sv2r-worker-b-hwlnc   <none>           <none>
      test-rc-sm4xw   1/1     Running   0          37m   10.129.2.9    jechen-udn-eip-6sv2r-worker-a-jbvmn   <none>           <none>
      test-rc-zrthk   1/1     Running   0          37m   10.128.2.29   jechen-udn-eip-6sv2r-worker-c-h5x7d   <none>           <none>

       

      from a local EIP pod

      $ oc rsh test-rc-sm4xw
      ~ $ curl -s 'http://34.160.111.145/?request=jhs6hv5x' --connect-timeout 5
      fault filter abort~ $ exit

      $ oc debug node/jechen-udn-eip-6sv2r-worker-a-jbvmn
      Temporary namespace openshift-debug-jcjtb is created for debugging node...
      Starting pod/jechen-udn-eip-6sv2r-worker-a-jbvmn-debug-r2wd2 ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.128.2
      If you don't see a command prompt, try pressing enter.
      sh-5.1# chroot /host
      sh-5.1# nmcli con show
      NAME                UUID                                  TYPE           DEVICE 
      ovs-if-br-ex        bd6d7911-19f3-4cba-93fc-6017b4243b32  ovs-interface  br-ex  
      br-ex               35bf9cfc-43d2-4dd2-a5f0-ccc01c97dbfb  ovs-bridge     br-ex  
      ovs-if-phys0        35b2a425-8287-4341-868f-91fbf5dba946  ethernet       ens4   
      ovs-port-br-ex      5de05d52-76ea-4762-8e19-c22467aac26c  ovs-port       br-ex  
      ovs-port-phys0      c972b7ae-4367-4fc6-b05d-740a718a3937  ovs-port       ens4   
      lo                  44a8a80f-8511-4c56-b56b-0b476146bca7  loopback       lo     
      Wired connection 1  7d5310c7-1fec-379f-a00c-68d096bc810c  ethernet       -     
      sh-5.1# exit
      exit
      sh-5.1# timeout 60s tcpdump -c 4 -nni ens4 host 34.160.111.145
      dropped privs to tcpdump
      tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
      listening on ens4, link-type EN10MB (Ethernet), snapshot length 262144 bytes
      15:44:52.530138 IP 10.0.128.101.49976 > 34.160.111.145.80: Flags [S], seq 2954161173, win 32760, options [mss 1260,sackOK,TS val 1389518110 ecr 0,nop,wscale 7], length 0
      15:44:52.532627 IP 34.160.111.145.80 > 10.0.128.101.49976: Flags [S.], seq 311399623, ack 2954161174, win 65535, options [mss 1412,sackOK,TS val 2112755645 ecr 1389518110,nop,wscale 8], length 0
      15:44:52.535995 IP 10.0.128.101.49976 > 34.160.111.145.80: Flags [P.], seq 1:96, ack 1, win 256, options [nop,nop,TS val 1389518120 ecr 2112755645], length 95: HTTP: GET /?request=jhs6hv5x HTTP/1.1
      15:44:52.536008 IP 10.0.128.101.49976 > 34.160.111.145.80: Flags [.], ack 1, win 256, options [nop,nop,TS val 1389518119 ecr 2112755645], length 0
      4 packets captured
      10 packets received by filter
      0 packets dropped by kernel

       

      from a remote EIP pod

      $ oc rsh test-rc-cj78b
      ~ $ curl -s 'http://34.160.111.145/?request=jhs6hv5x' --connect-timeout 5
      fault filter abort~ $ exit

       

      captured on jechen-udn-eip-6sv2r-worker-a-jbvmn

      sh-5.1# timeout 60s tcpdump -c 4 -nni ens4 host 34.160.111.145
      dropped privs to tcpdump
      tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
      listening on ens4, link-type EN10MB (Ethernet), snapshot length 262144 bytes
      15:45:25.684207 IP 10.0.128.101.43048 > 34.160.111.145.80: Flags [S], seq 1573638273, win 32760, options [mss 1260,sackOK,TS val 3867574280 ecr 0,nop,wscale 7], length 0
      15:45:25.686369 IP 34.160.111.145.80 > 10.0.128.101.43048: Flags [S.], seq 795599411, ack 1573638274, win 65535, options [mss 1412,sackOK,TS val 2136694170 ecr 3867574280,nop,wscale 8], length 0
      15:45:25.688597 IP 10.0.128.101.43048 > 34.160.111.145.80: Flags [.], ack 1, win 256, options [nop,nop,TS val 3867574286 ecr 2136694170], length 0
      15:45:25.688612 IP 10.0.128.101.43048 > 34.160.111.145.80: Flags [P.], seq 1:96, ack 1, win 256, options [nop,nop,TS val 3867574286 ecr 2136694170], length 95: HTTP: GET /?request=jhs6hv5x HTTP/1.1
      4 packets captured
      10 packets received by filter
      0 packets dropped by kernel

       

      5. Deleted egressIP object, verified from cloudprivateipconfig that the egressIP was indeed deleted

       

      $ oc delete egressips.k8s.ovn.org egressip-red
      egressip.k8s.ovn.org "egressip-red" deleted

      $ oc get egressips.k8s.ovn.org 
      No resources found

      $ oc get cloudprivateipconfig
      No resources found

      6. Repeated step 4, curl external from test pods

      Actual results: was not able to capture egress packets on pod's host

      Expected results: egress packets should be able to be captured on pod's host, pod's host IP should be used as source IP for these egress packets

      Additional info:

      By comparison, the same test worked for SGW mode for layer3 UDN, after egressIP was deleted, egress packets used pod's host IP as source IP

      ovnkube-node pod log: https://drive.google.com/file/d/1MQgDRIYnNM5RzkARThp-Flo2zavtYL6K/view?usp=drive_link

      must-gather: https://drive.google.com/file/d/1axEnHoEgzaTGQ27S2odicVUF1gYfXl-a/view?usp=drive_link

       

       

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal CI failure
      2. customer issue / SD
      3. internal RedHat testing failure

      If it is an internal RedHat testing failure:

      • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

      If it is a CI failure:

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      • If it's a connectivity issue,
      • What is the srcNode, srcIP and srcNamespace and srcPodName?
      • What is the dstNode, dstIP and dstNamespace and dstPodName?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
      • Don't presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
      • For guidance on using this template please see
        OCPBUGS Template Training for Networking  components

              mkennell@redhat.com Martin Kennelly
              jechen@redhat.com Jean Chen
              None
              None
              Jean Chen Jean Chen
              None
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: