Uploaded image for project: 'OpenShift Bugs'
  1. OpenShift Bugs
  2. OCPBUGS-53151

traffic cut after mac binding flows are removed

    • None
    • False
    • Hide

      None

      Show
      None
    • An issue in OVN MAC binding refresh on Azure has been corrected in this release.
    • Bug Fix
    • In Progress

      Description of problem:

      since https://github.com/ovn-kubernetes/ovn-kubernetes/commit/14e183088f94009bae0cb1d3e7329f72945cecae
      the mac binding flows reaching the configured max age are removed.

      That commit mentions an "unfortunately" part which is really our problem here, when the mac binding flows expire, ovs will want to execute a controller action to reinstall that flow. If ovn-controller is slow right after the flows got deleted, the new flows won't be installed in time, impacting the dataplane traffic. In practice we see regular cut of DNS traffic when ovn-controller is very slow to update address_sets, which happens really often on clusters running ovn-kubernetes with a lot of pods and network policies.

      I think this behavior is "known" (due to the "unfortunately" part in the commit).
      On OVN clusters running in a cloud environment like Azure, there is no value to discard the mac bindings flow. The gateway has a known mac address that never changes. In particular the usecase addressed by that commit

      > If any endpoint changed MAC address and sent a GARP

      can't happen on Azure.

      Version-Release number of selected component (if applicable):

      OCP4.14.48

      How reproducible:

      The actual traffic cut is tricky to reproduce but I can provide instructions if useful.

      Steps to Reproduce:

      1. run a cluster with ovn-kubernetes

      2. crictl exec -ti `crictl ps -q --name nbdb` ovn-nbctl --columns options list logical_router GR_${HOSTNAME}

      3.

      Actual results:
      options :

      {always_learn_from_arp_request="false", chassis="a9e346ec-62b7-4790-9184-b5a0e8ea00ec", dynamic_neigh_routers="true", lb_force_snat_ip=router_ip, mac_binding_age_threshold="300", snat-ct-zone="0"}

      Expected results:

      no mac_binding_age_threshold, or mac_binding_age_threshold="0", either being the default on Azure, or a configurable amount

      Additional info:

      Affected Platforms:

      OCP 4.14 on Azure , customer issue

              sdn-team-bot sdn-team bot
              frigault Francois Rigault
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Created:
                Updated: