Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2839

CLONE [ovn24.09 fast-datapath-rhel-9] - Upstream: [ovn-controller] assertion failure due to trying to write to a deleted IDL record

    • 2
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Please mark each item below with ( / ) if completed or ( x ) if incomplete:
      ( ) Unit test or Integration test case are written and pass successfully


      ( ) The upstream pull request is merged upstream and pass CI

      Show
      Please mark each item below with ( / ) if completed or ( x ) if incomplete: ( ) Unit test or Integration test case are written and pass successfully ( ) The upstream pull request is merged upstream and pass CI
    • ovn24.09-24.09.3-83.el9fdp
    • rhel-9
    • None
    • rhel-net-ovn

      This is tracking the upstream effort needed to deliver the solution to the bug described below.


       Problem Description: Clearly explain the issue.

      With ovn25.09-25.09.1-11.el9fdp ovn-controller hits an assertion failure:

      #5  vlog_abort (module=0x55cecc91d9c0 <this_module.lto_priv>, message=0x55cecc8b4290 "%s: assertion %s failed in %s()") at ovs-852f07e5251c6a0c0d5c43dc980d12a4f1bcd370/lib/vlog.c:1325
      #6  0x000055cecc861239 in ovs_assert_failure (where=<optimized out>, function=<optimized out>, condition=<optimized out>) at ovs-852f07e5251c6a0c0d5c43dc980d12a4f1bcd370/lib/util.c:90
      #7  0x000055cecc85006a in ovsdb_idl_txn_write__.constprop.0 (row_=0x55cee9440cc0, column=0x55cecc8f8e98 <sbrec_port_binding_columns+2520>, datum=0x7ffe8c1af020, owns_datum=true)
          at ovs-852f07e5251c6a0c0d5c43dc980d12a4f1bcd370/lib/ovsdb-idl.c:3650
      #8  0x000055cecc759b99 in ovsdb_idl_txn_write (row=0x55cee9440cc0, column=<optimized out>, datum=0x7ffe8c1af020) at ovs-852f07e5251c6a0c0d5c43dc980d12a4f1bcd370/lib/ovsdb-idl.c:3742
      #9  sbrec_port_binding_set_up (n_up=1, row=<optimized out>, up=<synthetic pointer>) at lib/ovn-sb-idl.c:39665
      #10 port_binding_set_down (chassis_rec=<optimized out>, pb_table=0x55cee8893a10, iface_id=<optimized out>, pb_uuid=0x55cee96dee40) at controller/binding.c:3700
      #11 if_status_mgr_update (mgr=<optimized out>, binding_data=<optimized out>, chassis_rec=<optimized out>, iface_table=<optimized out>, pb_table=<optimized out>, ovs_readonly=<optimized out>, sb_readonly=<optimized out>)
          at controller/if-status.c:645
      #12 0x000055cecc747916 in main (argc=<optimized out>, argv=<optimized out>) at controller/ovn-controller.c:7544
      

      while trying to write to a deleted record:

      #7  0x000055cecc85006a in ovsdb_idl_txn_write__.constprop.0 (row_=0x55cee9440cc0, column=0x55cecc8f8e98 <sbrec_port_binding_columns+2520>, datum=0x7ffe8c1af020, owns_datum=true)
          at ovs-852f07e5251c6a0c0d5c43dc980d12a4f1bcd370/lib/ovsdb-idl.c:3650
      3650        ovs_assert(row->new_datum != NULL);
      

      Originally hit in upstream ovn-kubernetes:
      https://github.com/ovn-kubernetes/ovn-kubernetes/actions/runs/19871655603/job/56951708396?pr=5764#step:16:18186

      With ovn-25.09.0-42.fc42.x86_64 on Fedora:42.

      However the code is identical on the RHEL 25.09 branch so it should crash in the same way.

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      control plane crash
       

       Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

      ovn25.09-25.09.1-11.el9fdp
       

        Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

      already existing
       

       Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

      Until now it was hit in ovn-k upstream CI but it should be possible to reproduce the scenario with plain OVN commands.
       

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

       

       Expected Behavior: Describe what should happen under normal circumstances.

      ovn-controller should not try to write to deleted IDL records.
       

       Observed Behavior: Explain what actually happens.

      imaximet@redhat.com observed that ovsdb_idl_get_row_for_uuid() may return (to be) deleted IDL records. A potential fix might be to only return rows that are still in the database table (and not marked for deletion). Ilya shared a tentative fix and we're testing it here:
      https://github.com/dceara/ovn/commits/refs/heads/branch-25.09-northd-idl-crashes/
      https://github.com/ovn-kubernetes/ovn-kubernetes/pull/5772
       

       Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

       

       Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)


              ovn-qe OVN QE (Inactive)
              ovnteam@redhat.com OVN Team
              OVN QE OVN QE (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated: