Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-48572

[BGP EIP on default network pre-merge testing] egressIPs (IP from same subnet or random IP) are not advertised

XMLWordPrintable

    • Critical
    • No
    • Rejected
    • False
    • Hide

      None

      Show
      None

      Description of problem: [BGP EIP on default network pre-merge testing] egressIPs (IP from same subnet or random IP) are not advertised 

      Version-Release number of selected component (if applicable):

      How reproducible:

      Steps to Reproduce:

      1.With BGP enabled, external frr container created, apply the following receive_all.yaml and ra.yaml

      [root@openshift-qe-026 configs]# cat receive_all.yaml
      apiVersion: frrk8s.metallb.io/v1beta1
      kind: FRRConfiguration
      metadata:
        name: receive-all
        namespace: openshift-frr-k8s
      spec:
        bgp:
          routers:
          - asn: 64512
            neighbors:
            - address: 192.168.111.1 
              asn: 64512
              toReceive:
                allowed:
                  mode: all

       

      [root@openshift-qe-026 jechen]# cat ra.yaml 
      apiVersion: k8s.ovn.org/v1
      kind: RouteAdvertisements
      metadata:
        name: default
      spec:
        networkSelector:
          matchLabels:
            k8s.ovn.org/default-network: ""
        advertisements:
        - "PodNetwork"
        - "EgressIP"

       

       

       

      2. label a node to be egress node

      2. create a namespace, add a label to the namespace that matches namespaceSelector of egressIP object that will be created in step 3

      3. create two egressIP objects, one with an IP from same subnet of the egress node, another is a random IP that is not used

      4. Create some test pods in the namespace

      5. Wait till egressIPs are assigned to egress node

      Due to https://issues.redhat.com/browse/OCPBUGS-48326, egressIP using same subnet IP would not be assigned,

      [root@openshift-qe-026 jechen]# oc get egressips.k8s.ovn.org 
      NAME        EGRESSIPS        ASSIGNED NODE   ASSIGNED EGRESSIPS
      egressip1   192.168.111.65                   
      egressip2   8.8.8.8          worker-2        8.8.8.8

      but, after play with the workaround described in comment section of OCPBUGS-48326, both egressIPs can be assigned to egress node

       

      [root@openshift-qe-026 jechen]# oc get egressips.k8s.ovn.org 
      NAME        EGRESSIPS        ASSIGNED NODE   ASSIGNED EGRESSIPS
      egressip1   192.168.111.65   worker-2        192.168.111.65
      egressip2   8.8.8.8          worker-2        8.8.8.8

       

      Actual results: none of the two egressIPs are advertised, 

      [root@openshift-qe-026 jechen]# sudo podman exec -it deb6437508fa /bin/sh
      / # vtysh
      % Can't open configuration file /etc/frr/vtysh.conf due to 'No such file or directory'.
      Configuration file[/etc/frr/frr.conf] processing failure: 11

      Hello, this is FRRouting (version 9.1.2_git).
      Copyright 1996-2005 Kunihiro Ishiguro, et al.

      openshift-qe-026.lab.eng.rdu2.redhat.com# show ip bgp
      BGP table version is 10, local router ID is 192.168.222.1, vrf id 0
      Default local pref 100, local AS 64512
      Status codes:  s suppressed, d damped, h history, * valid, > best, = multipath,
                     i internal, r RIB-failure, S Stale, R Removed
      Nexthop codes: @NNN nexthop's vrf id, < announce-nh-self
      Origin codes:  i - IGP, e - EGP, ? - incomplete
      RPKI validation codes: V valid, I invalid, N Not found

          Network          Next Hop            Metric LocPrf Weight Path
       *>i10.128.0.0/23    192.168.111.20           0    100      0 i
       *>i10.128.2.0/23    192.168.111.23           0    100      0 i
       *>i10.129.0.0/23    192.168.111.21           0    100      0 i
       *>i10.129.2.0/23    192.168.111.24           0    100      0 i
       *>i10.130.0.0/23    192.168.111.22           0    100      0 i
       *>i10.130.2.0/23    192.168.111.47           0    100      0 i
       *>i10.131.0.0/23    192.168.111.25           0    100      0 i
       *>i10.131.2.0/23    192.168.111.40           0    100      0 i
       *> 192.168.1.0/24   0.0.0.0                  0         32768 i
       *> 192.169.1.1/32   0.0.0.0                  0         32768 i

       

      [root@openshift-qe-026 jechen]# sudo podman exec -it deb6437508fa /bin/sh
      / # ip route show | grep bgp
      10.128.0.0/23 via 192.168.111.20 dev offloadbm proto bgp metric 20 
      10.128.2.0/23 via 192.168.111.23 dev offloadbm proto bgp metric 20 
      10.129.0.0/23 via 192.168.111.21 dev offloadbm proto bgp metric 20 
      10.129.2.0/23 via 192.168.111.24 dev offloadbm proto bgp metric 20 
      10.130.0.0/23 via 192.168.111.22 dev offloadbm proto bgp metric 20 
      10.130.2.0/23 via 192.168.111.47 dev offloadbm proto bgp metric 20 
      10.131.0.0/23 via 192.168.111.25 dev offloadbm proto bgp metric 20 
      10.131.2.0/23 via 192.168.111.40 dev offloadbm proto bgp metric 20 
      / # exit

       

       

      Expected results:  both egressIPs should be advertised, or primary egressIP from same subnet should be advertised as minimal

      Additional info:

      Please fill in the following template while reporting a bug and provide as much relevant information as possible. Doing so will give us the best chance to find a prompt resolution.

      Affected Platforms:

      Is it an

      1. internal CI failure
      2. customer issue / SD
      3. internal RedHat testing failure

      If it is an internal RedHat testing failure:

      • Please share a kubeconfig or creds to a live cluster for the assignee to debug/troubleshoot along with reproducer steps (specially if it's a telco use case like ICNI, secondary bridges or BM+kubevirt).

      If it is a CI failure:

      • Did it happen in different CI lanes? If so please provide links to multiple failures with the same error instance
      • Did it happen in both sdn and ovn jobs? If so please provide links to multiple failures with the same error instance
      • Did it happen in other platforms (e.g. aws, azure, gcp, baremetal etc) ? If so please provide links to multiple failures with the same error instance
      • When did the failure start happening? Please provide the UTC timestamp of the networking outage window from a sample failure run
      • If it's a connectivity issue,
      • What is the srcNode, srcIP and srcNamespace and srcPodName?
      • What is the dstNode, dstIP and dstNamespace and dstPodName?
      • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)

      If it is a customer / SD issue:

      • Provide enough information in the bug description that Engineering doesn't need to read the entire case history.
      • Don't presume that Engineering has access to Salesforce.
      • Do presume that Engineering will access attachments through supportshell.
      • Describe what each relevant attachment is intended to demonstrate (failed pods, log errors, OVS issues, etc).
      • Referring to the attached must-gather, sosreport or other attachment, please provide the following details:
        • If the issue is in a customer namespace then provide a namespace inspect.
        • If it is a connectivity issue:
          • What is the srcNode, srcNamespace, srcPodName and srcPodIP?
          • What is the dstNode, dstNamespace, dstPodName and dstPodIP?
          • What is the traffic path? (examples: pod2pod? pod2external?, pod2svc? pod2Node? etc)
          • Please provide the UTC timestamp networking outage window from must-gather
          • Please provide tcpdump pcaps taken during the outage filtered based on the above provided src/dst IPs
        • If it is not a connectivity issue:
          • Describe the steps taken so far to analyze the logs from networking components (cluster-network-operator, OVNK, SDN, openvswitch, ovs-configure etc) and the actual component where the issue was seen based on the attached must-gather. Please attach snippets of relevant logs around the window when problem has happened if any.
      • When showing the results from commands, include the entire command in the output.  
      • For OCPBUGS in which the issue has been identified, label with "sbr-triaged"
      • For OCPBUGS in which the issue has not been identified and needs Engineering help for root cause, label with "sbr-untriaged"
      • Do not set the priority, that is owned by Engineering and will be set when the bug is evaluated
      • Note: bugs that do not meet these minimum standards will be closed with label "SDN-Jira-template"
      • For guidance on using this template please see
        OCPBUGS Template Training for Networking  components

              jcaamano@redhat.com Jaime Caamaño Ruiz
              jechen@redhat.com Jean Chen
              Jean Chen Jean Chen
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: