Uploaded image for project: 'Red Hat OpenStack Services on OpenShift'
  1. Red Hat OpenStack Services on OpenShift
  2. OSPRH-16179

The ovn cluster falls apart frequently and then shuffles around the routers.

XMLWordPrintable

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Blocker Blocker
    • None
    • rhos-17.1.3
    • openstack-neutron
    • 5
    • False
    • False
    • ?
    • None
    • Hide

      Customer cleaned the MAC_Binding yesterday evening.

      Current state as of 14:06
      ~~~

      28373 MAC_Bindings
      55948 Logical_Flows

      and

      [root@lpctrl15003 openvswitch]# grep -c transaction\ error ovn-controller.log
      501
      ~~~

      so we've already had 500+ lines like
      ~~~
      2025-04-25T12:06:28.436Z|397182|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values (lrp-e0a0727b-a539-4e11-a30f-077d4eadaa93 and \"2a03:1e80:a15:52a::1:14d0\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID ed51fc63-dd10-4edd-b390-868cfc786635, was inserted by this transaction.  Second row, with UUID 0dc0f3a7-4908-4114-a1c2-e6aba6a08501, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}
      since midnight.
      ~~

      Show
      Customer cleaned the MAC_Binding yesterday evening. Current state as of 14:06 ~~~ 28373 MAC_Bindings 55948 Logical_Flows and [root@lpctrl15003 openvswitch] # grep -c transaction\ error ovn-controller.log 501 ~~~ so we've already had 500+ lines like ~~~ 2025-04-25T12:06:28.436Z|397182|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"MAC_Binding\" table to have identical values (lrp-e0a0727b-a539-4e11-a30f-077d4eadaa93 and \"2a03:1e80:a15:52a::1:14d0\") for index on columns \"logical_port\" and \"ip\".  First row, with UUID ed51fc63-dd10-4edd-b390-868cfc786635, was inserted by this transaction.  Second row, with UUID 0dc0f3a7-4908-4114-a1c2-e6aba6a08501, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"} since midnight. ~~
    • Important

      The ovn cluster falls apart frequently and then shuffles around the routers. Every time that happens we experience downtime for the customers.

      The customer faced issues multiple times this week, and currently has a complete connectivity loss for tenants.

      First, we thought that the problem was caused by the rapid growth of the
      MAC_Binding table and we suggested to following command:

      ~~

       # ovn-nbctl set logical_router <LR> options:mac_binding_age_threshold=300

      ~~~

      But today cu. had another occurrence of the issue, mac aging seems to work
      but doesn't seems the root cause:

       ""Monitoring says we were at 137 MAC_Bindings and ~50k Logical Flows at that point in time.""

      OVN logs shows many recalculation, many poll operations that took long 
      and high CPU usage.

              twilson@redhat.com Terry Wilson
              rhn-engineering-gkadam Ganesh Kadam
              rhos-dfg-networking-squad-neutron
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated:
                Resolved: