Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2040

Test Coverage: Issues due to datapath binding recreation when upgrading to 25.09

    • Icon: Task Task
    • Resolution: Obsolete
    • Icon: Blocker Blocker
    • None
    • None
    • ovn25.09
    • False
    • False
    • Hide

      ( ) The test coverage is aligned with the epic's acceptance criteria

      Show
      ( ) The test coverage is aligned with the epic's acceptance criteria
    • rhel-9
    • None

      This task is tracking the test case writing activities to cover the bug described below.

       Problem Description: Clearly explain the issue.

      Since https://github.com/ovn-org/ovn/commit/6919992d8781 ovn-northd will try to recreate all SB.Datapath_Bindings whose UUID doesn't match the NB counterpart (router or switch UUID).

      Unfortunately there are some problems with this approach:

      1. The schema change is not backwards compatible, the "type" field is an enum and doesn't allow empty values causing issues to active-backup SB deployments.  See more details in the suggested schema change here:
      https://github.com/dceara/ovn/commit/cdd9c2656f84d16ba94c3a0447c3b3e520e3b30d

      2. The helper function that returns the NB UUID associated with the SB datapath binding assumes that if the "type" field of the binding is populated the binding has already been converted to the "new-style".  That's not always true, datapath bindings are converted only on recompute of the "datapath-sync" I-P node.  That means that as long as the datapath-sync node can manage to incrementally process changes the helper function datapath_get_nb_uuid_and_type() returns incorrect NB UUID values.  This in turn has the undesired side effect of all records in the SB that reference the datapath to be recreated, e.g. Multicast_Group, Logical_Flow.

      A way to reproduce the problem is to start an OVN sandbox (on current main or 25.09 branch) but use a set of NB/SB databases created with a 25.03 OVN version (see attached DB generated with the ovn-setup.sh script and manual addition of dynamic mac bindings, igmp groups and learned_route records):

      $ make -j4 sandbox SANDBOXFLAGS="--n-controllers 0 --nbdb-source=$PWDnb-25.03.db --sbdb-source=$PWD/sb-25.03.db"
      # At this point the SB.datapath bindings have _NOT_ been recreated, they still have the old UUIDs but their type has been set:
      $ ovn-sbctl list datapath_binding 
      _uuid               : d130468c-823a-450a-ac85-90afb6b4291f
      external_ids        : {logical-switch="71119d3e-8efe-4d72-8145-9ee8aeab61ae", name=sw1}
      load_balancers      : []                                   
      tunnel_key          : 2                                    
      type                : logical-switch_uuid               : 7de63131-4129-4f1f-bbc5-54c2121a185d
      external_ids        : {logical-router="941f182d-d93b-4b73-a6ae-f661e81d06d2", name=lr0}
      load_balancers      : []                                   
      tunnel_key          : 3                                    
      type                : logical-router_uuid               : bc453546-ec57-4889-ba98-47a2b5803e08
      external_ids        : {logical-switch="89e0d65f-54e4-4032-b9da-2c1393fafc2f", name=sw0}
      load_balancers      : []                                   
      tunnel_key          : 1                                    
      type                : logical-switch
      
      # Add a new logical switch, this is incrementally processed by en-datapath-sync:
      $ ovn-nbctl ls-add ls2
      
      # Only the new switch datapath has its UUID synced:
      _uuid               : d130468c-823a-450a-ac85-90afb6b4291f
      external_ids        : {logical-switch="71119d3e-8efe-4d72-8145-9ee8aeab61ae", name=sw1}
      load_balancers      : []                                   
      tunnel_key          : 2                                    
      type                : logical-switch_uuid               : 7de63131-4129-4f1f-bbc5-54c2121a185d
      external_ids        : {logical-router="941f182d-d93b-4b73-a6ae-f661e81d06d2", name=lr0}
      load_balancers      : []                                   
      tunnel_key          : 3                                    
      type                : logical-router_uuid               : bc453546-ec57-4889-ba98-47a2b5803e08
      external_ids        : {logical-switch="89e0d65f-54e4-4032-b9da-2c1393fafc2f", name=sw0}
      load_balancers      : []                                   
      tunnel_key          : 1                                    
      type                : logical-switch_uuid               : 66e0fb43-92ca-45ce-9769-3de87748d2b9
      external_ids        : {logical-switch="66e0fb43-92ca-45ce-9769-3de87748d2b9", name=ls2}
      load_balancers      : []                                   
      tunnel_key          : 4                                    
      type                : logical-switch
      
      # The rest of the datapaths still have the OLD UUID.
      # This causes their logical flows to be regenerated, even though no change happened for them.
      # Focusing on flow "external_ids={source="northd.c:16642", stage-name=lr_in_unsnat}" for datpath 7de63131 (original lr0):
      $ ovsdb-tool show-log -mmm sandbox/sb1.db | grep -e record -e 'source="northd.c:16642"' -C 4
      [...]
      record 6: 2025-09-02 15:40:24.189 "ovn-northd"
        table Multicast_Group row "_MC_flood_l2" (3e4b8820) diff: 
          delete row
        table Multicast_Group row "_MC_flood_l2" (fd433812) diff: 
          delete row
      --
        table Logical_Flow insert row 91d31f4c:
          match="1"
          pipeline=ingress
          logical_datapath=7de63131-4129-4f1f-bbc5-54c2121a185d
          external_ids={source="northd.c:16642", stage-name=lr_in_unsnat}
          table_id=5
          actions="next;"
        table Logical_Flow insert row 5b7f5b02:
          match="1"
      --
          table_size=2048
          idle_timeout=300
          enabled=false
      
      record 7: 2025-09-02 15:40:24.239 "ovn-northd"
        table Multicast_Group insert row "_MC_flood_l2" (917c3a9f):
          ports=[fa9d276f-b412-4828-9c44-ebe32b411262]
          tunnel_key=32772
          name=_MC_flood_l2
      --
        table Logical_Flow insert row 8488e097:
          match="1"
          pipeline=ingress
          logical_datapath=7de63131-4129-4f1f-bbc5-54c2121a185d
          external_ids={source="northd.c:16642", stage-name=lr_in_unsnat}
          table_id=5
          actions="next;"
        table Logical_Flow insert row cf981467:
          match="ip4.src_mcast ||ip4.src == 255.255.255.255 || ip4.src == 127.0.0.0/8 || ip4.dst == 127.0.0.0/8 || ip4.src == 0.0.0.0/8 || ip4.dst == 0.0.0.0/8"
      
      

      This continues until the en-datapath-sync node recomputes which can happen way later.

      3. All "dynamic" SB records (Mac_Binding, IGMP_Group, Learned_Route) are invalidated for old-type datapaths and ovn-northd removes them.  This can severely disrupt traffic.  Using the same procedure as above, start a sandbox with the same DBs and check the SB.MAC_Binding, SB.IGMP_Group and SB.Learned_Routes tables.  They're empty after upgrade.  That's because their datapath pointer is seen as stale due to the incorrect behavior of the datapath_get_nb_uuid_and_type() function.

      However, even if that behavior is fixed, a recompute of the en-datapath-sync node will cause the old-style datapaths to be recreated and without additional work the "old" MAC_Binding, IGMP_Group and Learned_Route records will still be removed by northd (as they appear to be stale due to referencing SB Datapath_Bindings that are recreated).

      A few potential solutions for these issues have been discussed upstream: 

      https://mail.openvswitch.org/pipermail/ovs-dev/2025-September/425925.html

      Out of these only two seem to be viable:

      • option "b" (all of the following changes, i.e., patches 1-2 from this series and the follow up to update other Sb records):
        • update the SB.Datapath_Binding schema making "type" optional (issue 1)
        • immediately recreate SB.Datapath_Binding records transforming "old-style" records to "new-style" records
        • update all SB records that reference old style datapaths to reference the newly created new-style datapaths
      • option "d" (go back to the proposed schema from v15 of the I-P refactor)
        • proposed by Mark here
        • we still need to ensure that the newly added SB.Datapath_Binding field allows empty value to avoid issue "1" above

       

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      traffic disruption, control plane churn

       Software Versions:

      25.09.0

        Issue Type:

      Regression.

       Reproducibility:

      Always.

       Reproduction Steps:

      See description.

       Expected Behavior:

      Upgrades should not create downtime.

       Observed Behavior:

      Logical flows, multicast groups, etc, potentially continuously recreated for an indefinite time.  Dynamic SB records (mac_binding, igmp_group, learned_route) removed on upgrade causing traffic disruption.

       

              ovnteam@redhat.com OVN Team
              nstbot NST Bot
              OVN QE OVN QE
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: