Loading...

Type: Bug
Resolution: Done
Priority: Undefined
Fix Version/s: None
Affects Version/s: None
Component/s: ovn25.09
Labels:
None

Blocked Reason:

Hide

None

Show
None
OS:
rhel-9
Planning:
None
AssignedTeam:
rhel-net-ovn
Intelligence Requested:
Market:
Sub-System Group:

ssg_networking

SFDC Cases Links:
SFDC Cases Open:
SFDC Cases Counter:

Problem Description: Clearly explain the issue.

Using a reproducer script for issue ~~FDP-2780~~, I managed to cause a different crash to occur in ovn-northd in the en_datapath_synced_logical_switch_datapath_sync_handler(). It likely affects logical routers as well.

Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

This crashes ovn-northd.

Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

This happened on the HEAD of branch-25.09 as of 2 December, 2025. Specifically, commit 1fa36ec73f05252db1a0877a960918e004fda07a. This likely does not affect earlier OVN streams.

Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

New issue.

Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

I have a reproducer script that I wrote for the assertion reported in ~~FDP-2780~~. The script sometimes triggers that assertion, but other times it crashes ovn-northd with the segfault described in this issue. I have no idea what the likelihood is of running into this in the wild.

Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

The following reproducer script causes the crash:

#!/bin/bash

set -e

while [ 1 ] ; do
    ovn-nbctl ls-add ls1
    ovn-nbctl lb-add lb1 192.168.0.1 10.0.0.1
    lb_uuid=$(ovn-nbctl --bare --columns=_uuid find load_balancer name=lb1)

    # Pause ovn-northd so that it does not receive IDL updates while we
    # perform the next operations.

    echo "northd going to sleep"
    kill -STOP $(cat sandbox/ovn-northd.pid)
    uuid=$(uuidgen)
    ovn-nbctl --id=$uuid create load_balancer_group name=lbg1 load_balancer=$lb_uuid
    ovn-nbctl destroy load_balancer_group $uuid
    ovn-nbctl --id=$uuid create load_balancer_group name=lbg1 load_balancer=$lb_uuid
    ovn-nbctl set logical_switch ls1 load_balancer_group=$uuid

    # Now wake ovn-northd up and see if it asserts.
    echo "northd waking up"
    kill -CONT $(cat sandbox/ovn-northd.pid)

    ovn-nbctl ls-del ls1
    ovn-nbctl lb-del lb1
    ovn-nbctl --all destroy load_balancer_group
done

Expected Behavior: Describe what should happen under normal circumstances.

ovn-northd should not crash.

Observed Behavior: Explain what actually happens.

ovn-northd crashes when trying to process the synced datapath for the added logical switch. Specifically, the sdp->nb_row appears to be pointing to junk data. When we try to dereference sdp->nb_row->table->class_, it causes a segfault.

Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

I reported this as soon as I realized the reproducer causes the crash. It will be up to the assignee on this issue to try to determine why the crash is occurring. I originally suspected that the reproducer's quick deletion and re-adding of the logical switch might cause some confusing IDL messages. However, since the logical switch has a different UUID each time it is re-added, I think this is unlikely.

Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

The reproducer can be used to get a core file. However, here is a quick backtrace I got when I saw the issue the first time:

(gdb) bt
#0  0x000000000043294b in en_datapath_synced_logical_switch_run (node=<optimized out>, data=0x2730a0f0) at northd/en-datapath-logical-switch.c:297
#1  0x000000000045b148 in engine_recompute (node=node@entry=0x733ee0 <en_datapath_synced_logical_switch>, allowed=allowed@entry=true, reason_fmt=reason_fmt@entry=0x61d2c4 "failed handler for input %s") at lib/inc-proc-eng.c:443
#2  0x000000000045bc8c in engine_compute (node=<optimized out>, recompute_allowed=<optimized out>) at lib/inc-proc-eng.c:486
#3  engine_run_node (node=0x733ee0 <en_datapath_synced_logical_switch>, recompute_allowed=<optimized out>) at lib/inc-proc-eng.c:545
#4  engine_run (recompute_allowed=recompute_allowed@entry=true) at lib/inc-proc-eng.c:571
#5  0x000000000044d98b in inc_proc_northd_run (ovnnb_txn=ovnnb_txn@entry=0x2743c420, ovnsb_txn=ovnsb_txn@entry=0x2740d790, ctx=ctx@entry=0x7fffc314ad40) at northd/inc-proc-northd.c:580
#6  0x00000000004048b6 in main (argc=<optimized out>, argv=<optimized out>) at northd/ovn-northd.c:1096
(gdb) list
292	    synced_logical_switch_map_destroy(switch_map);
293	    synced_logical_switch_map_init(switch_map);
294	
295	    struct ovn_synced_datapath *sdp;
296	    HMAP_FOR_EACH (sdp, hmap_node, &dps->synced_dps) {
297	        if (sdp->nb_row->table->class_ != &nbrec_table_logical_switch) {
298	            continue;
299	        }
300	        struct ovn_synced_logical_switch *lsw =
301	            synced_logical_switch_alloc(sdp);
(gdb) p sdp
$1 = (struct ovn_synced_datapath *) 0x27430450
(gdb) p sdp->nb_row
$2 = (const struct ovsdb_idl_row *) 0x273fc9f0
(gdb) p sdp->nb_row->table
$3 = (struct ovsdb_idl_table *) 0x2d333831612d3033
(gdb) p sdp->nb_row->table->class_
Cannot access memory at address 0x2d333831612d3033
(gdb) p sdp->nb_row->table
$4 = (struct ovsdb_idl_table *) 0x2d333831612d3033
(gdb) p *sdp->nb_row->table
Cannot access memory at address 0x2d333831612d3033
(gdb) p *sdp->nb_row
$5 = {hmap_node = {hash = 658485756, next = 0x27422180}, uuid = {parts = {658763392, 0, 658827088, 0}}, src_arcs = {prev = 0x273fca10, next = 0x31}, dst_arcs = {prev = 0x273db95c, next = 0x30342d373331322d}, table = 0x2d333831612d3033, old_datum = 0x3330633137333563, 
  persist_uuid = 97, parsed = 97, reparse_node = {prev = 0x31, next = 0x273db98c}, new_datum = 0x0, prereqs = 0x273fca80, written = 0x0, txn_node = {hash = 0, next = 0x31}, map_op_written = 0x273db8cc, map_op_lists = 0x0, set_op_written = 0x0, set_op_lists = 0x0, 
  change_seqno = {0, 0, 49}, track_node = {prev = 0x273db9bc, next = 0x0}, updated = 0x273fca20, tracked_old_datum = 0x273fca50}

is cloned by

FDP-2812 CLONE [ovn26.03 fast-datapath-rhel-9] - Segfault in synced logical datapath handler

Verified

FDP-2817 CLONE [ovn26.03 fast-datapath-rhel-10] - Segfault in synced logical datapath handler

Verified

FDP-2822 CLONE [ovn25.09 fast-datapath-rhel-9] - Segfault in synced logical datapath handler

Closed

FDP-2827 CLONE [ovn25.09 fast-datapath-rhel-10] - Segfault in synced logical datapath handler

Closed

Details

Description

Problem Description: Clearly explain the issue.

Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

Expected Behavior: Describe what should happen under normal circumstances.

Observed Behavior: Explain what actually happens.

Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

Attachments

Issue Links

Easy Agile Planning Poker

Activity

People

Dates