Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2805

Segfault in synced logical datapath handler

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Undefined Undefined
    • None
    • None
    • ovn25.09
    • None
    • Hide

      None

      Show
      None
    • rhel-9
    • None
    • rhel-net-ovn
    • ssg_networking

       Problem Description: Clearly explain the issue.

      Using a reproducer script for issue FDP-2780, I managed to cause a different crash to occur in ovn-northd in the en_datapath_synced_logical_switch_datapath_sync_handler(). It likely affects logical routers as well.
       

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      This crashes ovn-northd.

       

       Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

      This happened on the HEAD of branch-25.09 as of 2 December, 2025. Specifically, commit 1fa36ec73f05252db1a0877a960918e004fda07a. This likely does not affect earlier OVN streams.

       

        Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

      New issue.

       

       Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

      I have a reproducer script that I wrote for the assertion reported in FDP-2780. The script sometimes triggers that assertion, but other times it crashes ovn-northd with the segfault described in this issue. I have no idea what the likelihood is of running into this in the wild.

       

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

      The following reproducer script causes the crash:

      #!/bin/bash
      
      set -e
      
      while [ 1 ] ; do
          ovn-nbctl ls-add ls1
          ovn-nbctl lb-add lb1 192.168.0.1 10.0.0.1
          lb_uuid=$(ovn-nbctl --bare --columns=_uuid find load_balancer name=lb1)
      
          # Pause ovn-northd so that it does not receive IDL updates while we
          # perform the next operations.
      
          echo "northd going to sleep"
          kill -STOP $(cat sandbox/ovn-northd.pid)
          uuid=$(uuidgen)
          ovn-nbctl --id=$uuid create load_balancer_group name=lbg1 load_balancer=$lb_uuid
          ovn-nbctl destroy load_balancer_group $uuid
          ovn-nbctl --id=$uuid create load_balancer_group name=lbg1 load_balancer=$lb_uuid
          ovn-nbctl set logical_switch ls1 load_balancer_group=$uuid
      
          # Now wake ovn-northd up and see if it asserts.
          echo "northd waking up"
          kill -CONT $(cat sandbox/ovn-northd.pid)
      
          ovn-nbctl ls-del ls1
          ovn-nbctl lb-del lb1
          ovn-nbctl --all destroy load_balancer_group
      done
      

       

       Expected Behavior: Describe what should happen under normal circumstances.

      ovn-northd should not crash.
       

       Observed Behavior: Explain what actually happens.

      ovn-northd crashes when trying to process the synced datapath for the added logical switch. Specifically, the sdp->nb_row appears to be pointing to junk data. When we try to dereference sdp->nb_row->table->class_, it causes a segfault.

       

       Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

      I reported this as soon as I realized the reproducer causes the crash. It will be up to the assignee on this issue to try to determine why the crash is occurring. I originally suspected that the reproducer's quick deletion and re-adding of the logical switch might cause some confusing IDL messages. However, since the logical switch has a different UUID each time it is re-added, I think this is unlikely.

       

       Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

      The reproducer can be used to get a core file. However, here is a quick backtrace I got when I saw the issue the first time:

      (gdb) bt
      #0  0x000000000043294b in en_datapath_synced_logical_switch_run (node=<optimized out>, data=0x2730a0f0) at northd/en-datapath-logical-switch.c:297
      #1  0x000000000045b148 in engine_recompute (node=node@entry=0x733ee0 <en_datapath_synced_logical_switch>, allowed=allowed@entry=true, reason_fmt=reason_fmt@entry=0x61d2c4 "failed handler for input %s") at lib/inc-proc-eng.c:443
      #2  0x000000000045bc8c in engine_compute (node=<optimized out>, recompute_allowed=<optimized out>) at lib/inc-proc-eng.c:486
      #3  engine_run_node (node=0x733ee0 <en_datapath_synced_logical_switch>, recompute_allowed=<optimized out>) at lib/inc-proc-eng.c:545
      #4  engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:571
      #5  0x000000000044d98b in inc_proc_northd_run (ovnnb_txn=ovnnb_txn@entry=0x2743c420, ovnsb_txn=ovnsb_txn@entry=0x2740d790, ctx=ctx@entry=0x7fffc314ad40) at northd/inc-proc-northd.c:580
      #6  0x00000000004048b6 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:1096
      (gdb) list
      292	    synced_logical_switch_map_destroy(switch_map);
      293	    synced_logical_switch_map_init(switch_map);
      294	
      295	    struct ovn_synced_datapath *sdp;
      296	    HMAP_FOR_EACH (sdp, hmap_node, &dps->synced_dps) {
      297	        if (sdp->nb_row->table->class_ != &nbrec_table_logical_switch) {
      298	            continue;
      299	        }
      300	        struct ovn_synced_logical_switch *lsw =
      301	            synced_logical_switch_alloc(sdp);
      (gdb) p sdp
      $1 = (struct ovn_synced_datapath *) 0x27430450
      (gdb) p sdp->nb_row
      $2 = (const struct ovsdb_idl_row *) 0x273fc9f0
      (gdb) p sdp->nb_row->table
      $3 = (struct ovsdb_idl_table *) 0x2d333831612d3033
      (gdb) p sdp->nb_row->table->class_
      Cannot access memory at address 0x2d333831612d3033
      (gdb) p sdp->nb_row->table
      $4 = (struct ovsdb_idl_table *) 0x2d333831612d3033
      (gdb) p *sdp->nb_row->table
      Cannot access memory at address 0x2d333831612d3033
      (gdb) p *sdp->nb_row
      $5 = {hmap_node = {hash = 658485756, next = 0x27422180}, uuid = {parts = {658763392, 0, 658827088, 0}}, src_arcs = {prev = 0x273fca10, next = 0x31}, dst_arcs = {prev = 0x273db95c, next = 0x30342d373331322d}, table = 0x2d333831612d3033, old_datum = 0x3330633137333563, 
        persist_uuid = 97, parsed = 97, reparse_node = {prev = 0x31, next = 0x273db98c}, new_datum = 0x0, prereqs = 0x273fca80, written = 0x0, txn_node = {hash = 0, next = 0x31}, map_op_written = 0x273db8cc, map_op_lists = 0x0, set_op_written = 0x0, set_op_lists = 0x0, 
        change_seqno = {0, 0, 49}, track_node = {prev = 0x273db9bc, next = 0x0}, updated = 0x273fca20, tracked_old_datum = 0x273fca50}
      

              ovnteam@redhat.com OVN Team
              mmichelson Mark Michelson
              OVN QE OVN QE (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

                Created:
                Updated: