Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-1373

[RAFT] Leadership transfers seem to be a cause of Neutron races

    • Icon: Bug Bug
    • Resolution: Can't Do
    • Icon: Major Major
    • None
    • None
    • openvswitch3.5
    • None
    • False
    • Hide

      None

      Show
      None
    • False
    • Hide

      Given an OVS deployment with active BFD sessions on multiple chassis and high flow reprogramming activity,

      When leadership changes or recomputations happen under load,

      Then, BFD echo/keepalive packets should still be processed and responded to within the expected interval to avoid false timeouts and unnecessary port failovers.

      Show
      Given an OVS deployment with active BFD sessions on multiple chassis and high flow reprogramming activity, When leadership changes or recomputations happen under load, Then, BFD echo/keepalive packets should still be processed and responded to within the expected interval to avoid false timeouts and unnecessary port failovers.
    • rhel-9
    • None
    • rhel-net-ovs-dpdk
    • ssg_networking

       Problem Description: Clearly explain the issue.

      From time to time RAFT transfers leadership to write snapshots:

      2025-04-17T03:51:58.983Z|00046|raft|INFO|Transferring leadership to write a snapshot.
      

      The problem is that this transfer seem to cause interruptions on Neutron side explained in https://issues.redhat.com/browse/OSPRH-14377 and https://issues.redhat.com/browse/OSPRH-16149

      In https://issues.redhat.com/browse/OSPRH-16149 a Neutron port was created during leadership transfer, then Neutron failed to bind port because on OVN side it didn't exist.

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      Some of instances startups in RHOSP environment fail. This may complicate automation on customer's side and introduce a requirement to implement some cleanup + retry logic.

       

       Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

      openvswitch3.3-3.3.0-49.el9fdp.x86_64
       

        Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

      Likely new one
       

       Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

      It happens occasionally in customer's deployment when batches of instances are started simultaneously with changes on OVN cluster side.
       

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

      Irrelevant
       

       Expected Behavior: Describe what should happen under normal circumstances.

      Neutron operations shouldn't be interrupted by OVN issues, OVN should provide consistent communications with its control plane.
       

       Observed Behavior: Explain what actually happens.

      It looks like that leadership transfer causes communication timeouts.
       

       Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

      Compare Neutron error logs from https://issues.redhat.com/browse/OSPRH-16149 with OVN events
       

       Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

      latest set of sosreport contain relevant messages in /var/log/containers/openvswitch/ovsdb-server-nb.log and /var/log/containers/openvswitch/ovsdb-server-sb.log. Issue happened on 2025-04-17 at 04:52 local time/ 03:52 UTC

              pvalerio@redhat.com Paolo Valerio
              rhn-support-astupnik Alex Stupnikov
              Jianlin Shi Jianlin Shi
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: