Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-41339

[Pre-Merge-Testing] ExternalIP service cannot be accessed (UDPN/Layer3/SGW)

XMLWordPrintable

    • Important
    • No
    • SDN Sprint 262, SDN Sprint 263, SDN Sprint 264, SDN Sprint 265
    • 4
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem:

      Version-Release number of selected component (if applicable):
      build openshift/ovn-kubernetes#2286,openshift/api#2005

      How reproducible:
      Always
      Steps to Reproduce:

      % oc get nodes -o wide
      NAME                                  STATUS   ROLES                  AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                KERNEL-VERSION                 CONTAINER-RUNTIME
      huirwang-0906a-b6zkg-master-0         Ready    control-plane,master   6h26m   v1.30.3   10.0.0.5      <none>        Red Hat Enterprise Linux CoreOS 417.94.202409032126-0   5.14.0-427.35.1.el9_4.x86_64   cri-o://1.30.5-2.rhaos4.17.gitdf27b8f.el9
      huirwang-0906a-b6zkg-master-1         Ready    control-plane,master   6h25m   v1.30.3   10.0.0.4      <none>        Red Hat Enterprise Linux CoreOS 417.94.202409032126-0   5.14.0-427.35.1.el9_4.x86_64   cri-o://1.30.5-2.rhaos4.17.gitdf27b8f.el9
      huirwang-0906a-b6zkg-master-2         Ready    control-plane,master   6h25m   v1.30.3   10.0.0.6      <none>        Red Hat Enterprise Linux CoreOS 417.94.202409032126-0   5.14.0-427.35.1.el9_4.x86_64   cri-o://1.30.5-2.rhaos4.17.gitdf27b8f.el9
      huirwang-0906a-b6zkg-worker-a-qvpxp   Ready    worker                 6h14m   v1.30.3   10.0.128.2    <none>        Red Hat Enterprise Linux CoreOS 417.94.202409032126-0   5.14.0-427.35.1.el9_4.x86_64   cri-o://1.30.5-2.rhaos4.17.gitdf27b8f.el9
      huirwang-0906a-b6zkg-worker-b-9r9ps   Ready    worker                 6h13m   v1.30.3   10.0.128.3    <none>        Red Hat Enterprise Linux CoreOS 417.94.202409032126-0   5.14.0-427.35.1.el9_4.x86_64   cri-o://1.30.5-2.rhaos4.17.gitdf27b8f.el9
      huirwang-0906a-b6zkg-worker-c-p5rgt   Ready    worker                 6h14m   v1.30.3   10.0.128.4    <none>        Red Hat Enterprise Linux CoreOS 417.94.202409032126-0   5.14.0-427.35.1.el9_4.x86_64   cri-o://1.30.5-2.rhaos4.17.gitdf27b8f.el9
      

      1. Edit Network.operator, added below externalIP policy

      spec:
        externalIP:
          policy:
            allowedCIDRs:
            - 10.0.128.0/24
      

      2. Create namespace ns1

      3. Create layer3 CRD
      % oc get UserDefinedNetwork -n ns1 -o yaml
      apiVersion: v1
      items:

      • apiVersion: k8s.ovn.org/v1
        kind: UserDefinedNetwork
        metadata:
        creationTimestamp: "2024-09-06T08:10:31Z"
        finalizers:
      • k8s.ovn.org/user-defined-network-protection
        generation: 1
        name: udn-network
        namespace: ns1
        resourceVersion: "143034"
        uid: d71f468a-32c0-419c-87c7-f39953a0c596
        spec:
        layer3:
        role: Primary
        subnets:
      • cidr: 10.200.0.0/16
        hostSubnet: 24
        topology: Layer3
        status:
        conditions:
      • lastTransitionTime: "2024-09-06T08:10:31Z"
        message: NetworkAttachmentDefinition has been created
        reason: NetworkAttachmentDefinitionReady
        status: "True"
        type: NetworkReady
        kind: List
        metadata:
        resourceVersion: ""

      4. Create backend pod and externalIP service

      % oc get pods -n ns1 -o wide
      NAME            READY   STATUS    RESTARTS   AGE   IP            NODE                                  NOMINATED NODE   READINESS GATES
      hello-pod       1/1     Running   0          24m   10.131.0.72   huirwang-0906a-b6zkg-worker-a-qvpxp   <none>           <none>
      % oc get svc -n ns1
      NAME        TYPE        CLUSTER-IP     EXTERNAL-IP    PORT(S)     AGE
      hello-pod   ClusterIP   172.30.80.61   10.0.128.102   27017/TCP   25m
      

      5. From differnt node than huirwang-0906a-b6zkg-worker-a-qvpxp to access externalIP service

      Actual results:
      Not able to access externalIP service from different node.

      % oc debug node/huirwang-0906a-b6zkg-master-0 
      Starting pod/huirwang-0906a-b6zkg-master-0-debug-2lgj8 ...
      To use host binaries, run `chroot /host`
      Pod IP: 10.0.0.5
      If you don't see a command prompt, try pressing enter.
      sh-5.1# curl 10.0.128.102:27017 --connect-timeout 5
      curl: (28) Connection timed out after 5000 milliseconds
      sh-5.1# exit
      exit
      
      Removing debug pod ...
      

      Expected results:
      Should be able to access the externalIP service from different node than backend pod's located node.

      Additional info:
      Be able to access externalIP service from udn pods

      % oc rsh -n ns1 test-rc-29mm4  
      ~ $ curl 10.0.128.102:27017
      Hello OpenShift!
      

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal CI failure
      2. customer issue / SD
      3. internal RedHat testing failure

      If it is an internal RedHat testing failure:

      • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

      If it is a CI failure:

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      • If it's a connectivity issue,
      • What is the srcNode, srcIP and srcNamespace and srcPodName?
      • What is the dstNode, dstIP and dstNamespace and dstPodName?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn’t need to read the entire case history.
      • Don’t presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with “sbr-triaged”
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with “sbr-untriaged”
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label “SDN-Jira-template”
      • For guidance on using this template please see
        OCPBUGS Template Training for Networking  components

              mkennell@redhat.com Martin Kennelly
              huirwang Huiran Wang
              Huiran Wang Huiran Wang
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated: