Uploaded image for project: 'Fast Datapath Product'
  1. Fast Datapath Product
  2. FDP-2070

Test Coverage: ovn-controller can overload itself with packet-ins due to splitting of mc_group flows

    • Icon: Task Task
    • Resolution: Obsolete
    • Icon: Major Major
    • None
    • None
    • ovn24.03
    • False
    • False
    • Hide

      ( ) The test coverage is aligned with the epic's acceptance criteria

      Given a logical switch with large multicast fan-out on a kernel with the RHEL-83440 fix,

      When multicast frames arrive and OVN programs flows without action-list splitting/controller() recirculation,

      Then packets are forwarded purely in the datapath (no controller() packet-ins for steady-state forwarding), and ovn-controller/ovs-vswitchd CPU stays within normal bounds under the same traffic load.

      Show
      ( ) The test coverage is aligned with the epic's acceptance criteria Given a logical switch with large multicast fan-out on a kernel with the RHEL-83440 fix, When multicast frames arrive and OVN programs flows without action-list splitting/controller() recirculation, Then packets are forwarded purely in the datapath (no controller() packet-ins for steady-state forwarding), and ovn-controller/ovs-vswitchd CPU stays within normal bounds under the same traffic load.
    • rhel-9
    • None

      This task is tracking the test case writing activities to cover the bug described below.

       Problem Description: Clearly explain the issue.

      Since https://github.com/ovn-org/ovn/commit/325c7b2 ovn-controller splits openflows generated for multicast groups (IP multicast but also MC_FLOOD, MC_UNKNOWN, etc) into chains of rules essentially interleaving a controller() action in between other actions if the total length of the rule action would take more than MC_OFPACTS_MAX_MSG_SIZE otherwise.

      This has the unwanted side effect of flooding the controller with these "controller recirculated" packets generating more harm than if the packet would be dropped due to a too large datapath flow action list.

      In the meantime it has been determined that the underlying problem that caused ovs-vswitchd generated datapath flows to have too large action lists was a actually kernel bug (RHEL-83440).

      Now that the original problem has been fixed in the kernel we should probably revert the OVN commit to avoid the controller DoSing itself.

      This issue has been reported in a couple of cases already:

      https://mail.openvswitch.org/pipermail/ovs-discuss/2025-February/053455.html

      https://issues.redhat.com/browse/OCPBUGS-61000

      Because we still have (at least) layered products using OVN on RHEL 9.2 we need the OVN revert to happen after the kernel fix is ported to RHEL 9.2, tracked in RHEL-87209.

       Impact Assessment: Describe the severity and impact (e.g., network down,availability of a workaround, etc.).

      Network impact: high CPU usage for ovs-vswitch, ovn-controller.

       Software Versions: Specify the exact versions in use (e.g.,openvswitch3.1-3.1.0-147.el8fdp).

      ovn24.03-24.03.6-26.el9fdp (actually any supported OVN stream)

        Issue Type: Indicate whether this is a new issue or a regression (if a regression, state the last known working version).

      Regression introduced by https://github.com/ovn-org/ovn/commit/325c7b2

       Reproducibility: Confirm if the issue can be reproduced consistently. If not, describe how often it occurs.

      Constant

       Reproduction Steps: Provide detailed steps or scripts to replicate the issue.

      Setup an OVN topology with a reasonably large number of logical switch ports on a given switch such that the resulting openflows for the corresponding switch multicast groups are split into chains.

      Send multicast traffic.

       Expected Behavior: Describe what should happen under normal circumstances.

      Traffic should be forwarded to the destinations (at least until the number of destinations doesn't cause the maximum OVS resubmit limit to be hit).

       Observed Behavior: Explain what actually happens.

      All multicast packets are forwarded through a chain of controller action => high cpu for both ovs-vswitchd and ovn-controller and network impact.

       Troubleshooting Actions: Outline the steps taken to diagnose or resolve the issue so far.

       

       Logs: If you collected logs please provide them (e.g. sos report, /var/log/openvswitch/* , testpmd console)

              ovnteam@redhat.com OVN Team
              nstbot NST Bot
              OVN QE OVN QE
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

                Created:
                Updated:
                Resolved: