Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2013

Upstream: Handle ovn-encap-ip duplicates.

    • Icon: Bug Bug
    • Resolution: Unresolved
    • Icon: Normal Normal
    • None
    • None
    • OVN
    • None
    • 3
    • Hide

      Please mark each item below with ( / ) if completed or ( x ) if incomplete:
      ( ) Unit test or Integration test case are written and pass successfully


      ( ) The upstream pull request is merged upstream and pass CI

      Show
      Please mark each item below with ( / ) if completed or ( x ) if incomplete: ( ) Unit test or Integration test case are written and pass successfully ( ) The upstream pull request is merged upstream and pass CI
    • rhel-9
    • None
    • rhel-net-ovn
    • OVN FDP Sprint 10, OVN FDP Sprint 11
    • 2

      This is tracking the upstream effort needed to deliver the solution to the bug described below.


      Configuring multiple nodes with the same ovn-encap-ip cause OVS db (conf.db) to increase rapidly in size.

      Let's suppose multiple nodes properly configured, all using a different local_ip, e.g hv1 using ip1, hv2 using ip2 and hv2 using ip3.

      If then one node (e.g. hv1) is (re)configured with the same ovn-encap-ip as another node (e.g. hv2)  (i.e. as hv1 ovs-vsctl set open . external-ids:ovn-encap-ip=ip2) , the current ovn-controller will:

      • Delete existing geneve tunnels (using ip1).
      • Recreate new tunnels (using ip2).
      • Create Encap with ip2 and chassis=hv1 in sb.

      This commit will fail, as there is already an Encap with ip2 and chassis=hv2. When detecting the commit failure, OVN will :

      • Delete existing geneve tunnels (using ip2).
      • Recreate new tunnels using ip1.

      In the next run, ovn-controller will try again to use ip2, deleting existing tunnels (using ip1), recreating new ones (using ip2), and trying again to commit to sb. And so on.

      Ovn-controller could detect that an Encap with ip2, same type, and chassis != hv1 already exists in sb, and in this case prevent deleting/recreating tunnels, preventing trying (and failing) to write Encap in sb, and log an (rate limited to e.g. 1 per 10 seconds) error.

      It is a configuration issue, so the configuration issue should be fixed by CMS/User. However, while the issue is present, this causes high CPU on SB, ovn-controller and ovs. In addition and OVS db (conf.db) is increasing quickly in size (a local test with a few chassis show an increase of 1 MB per second).

      When the issue is resolved by CMS (corrected ovn-encap-ip on hv1 or hv2), then ovn-controller wakes up due to ovs or sb change and properly handles the change.

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

      There might be simpler ways, but it can be easily reproduced using ovn-fake-multinode:

      Find local_ip for ovn-chassis-2: ip2=$(podman exec ovn-chassis-2 ovs-vsctl get open . external_ids:ovn-encap-ip)

      Use this ip in chassis-3: podman exec ovn-chassis-3 ovs-vsctl set open . external-ids:ovn-encap-ip=$ip2

       

       Expected Behavior: Describe what should happen under normal circumstances.

      ovn-controller should log an (rate limited) error, and avoid looping in deletion & creation of tunnels and commits failures to sb.

       Observed Behavior: Explain what actually happens.

      conf.db on ovn-chassis-3 (dramatically) increases in size, we see commit failures in sb, ovn-controller deletes and recreate tunnels.

      Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

      See https://issues.redhat.com/browse/OSPRH-17558


              roriorde@redhat.com Rosemarie O'Riorden
              nstbot NST Bot
              OVN QE OVN QE (Inactive)
              OVN
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: